All of lore.kernel.org
 help / color / mirror / Atom feed
* Linux 2.6.29
@ 2009-03-23 23:29 Linus Torvalds
  2009-03-24  6:19 ` Jesper Krogh
                   ` (2 more replies)
  0 siblings, 3 replies; 664+ messages in thread
From: Linus Torvalds @ 2009-03-23 23:29 UTC (permalink / raw)
  To: Linux Kernel Mailing List


It's out there now, or at least in the process of getting mirrored out.

The most obvious change is the (temporary) change of logo to Tuz, the 
Tasmanian Devil. But there's a number of driver updates and some m68k 
header updates (fixing headers_install after the merge of non-MMU/MMU) 
that end up being pretty noticeable in the diffs.

The shortlog (from -rc8, obviously - the full logs from 2.6.28 are too big 
to even contemplate attaching here) is appended, and most of the non-logo 
changes really shouldn't be all that noticeable to most people. Nothing 
really exciting, although I admit to fleetingly considering another -rc 
series just because the changes are bigger than I would have wished for 
this late in the game. But there was little point in holding off the real 
release any longer, I feel.

This obviously starts the merge window for 2.6.30, although as usual, I'll 
probably wait a day or two before I start actively merging. I do that in 
order to hopefully result in people testing the final plain 2.6.29 a bit 
more before all the crazy changes start up again.

		Linus

---
Aaro Koskinen (2):
      ARM: OMAP: sched_clock() corrected
      ARM: OMAP: Allow I2C bus driver to be compiled as a module

Abhijeet Joglekar (2):
      [SCSI] libfc: Pass lport in exch_mgr_reset
      [SCSI] libfc: when rport goes away (re-plogi), clean up exchanges to/from rport

Achilleas Kotsis (1):
      USB: Add device id for Option GTM380 to option driver

Al Viro (1):
      net: fix sctp breakage

Alan Stern (2):
      USB: usbfs: keep async URBs until the device file is closed
      USB: EHCI: expedite unlinks when the root hub is suspended

Albert Pauw (1):
      USB: option.c: add ZTE 622 modem device

Alexander Duyck (1):
      igb: remove ASPM L0s workaround

Andrew Vasquez (4):
      [SCSI] qla2xxx: Correct address range checking for option-rom updates.
      [SCSI] qla2xxx: Correct truncation in return-code status checking.
      [SCSI] qla2xxx: Correct overwrite of pre-assigned init-control-block structure size.
      [SCSI] qla2xxx: Update version number to 8.03.00-k4.

Andy Whitcroft (1):
      suspend: switch the Asus Pundit P1-AH2 to old ACPI sleep ordering

Anirban Chakraborty (1):
      [SCSI] qla2xxx: Correct vport delete bug.

Anton Vorontsov (1):
      ucc_geth: Fix oops when using fixed-link support

Antti Palosaari (1):
      V4L/DVB (10972): zl10353: i2c_gate_ctrl bug fix

Axel Wachtler (1):
      USB: serial: add FTDI USB/Serial converter devices

Ben Dooks (6):
      [ARM] S3C64XX: Set GPIO pin when select IRQ_EINT type
      [ARM] S3C64XX: Rename IRQ_UHOST to IRQ_USBH
      [ARM] S3C64XX: Fix name of USB host clock.
      [ARM] S3C64XX: Fix USB host clock mux list
      [ARM] S3C64XX: sparse warnings in arch/arm/plat-s3c64xx/s3c6400-clock.c
      [ARM] S3C64XX: sparse warnings in arch/arm/plat-s3c64xx/irq.c

Benjamin Herrenschmidt (2):
      emac: Fix clock control for 405EX and 405EXr chips
      radeonfb: Whack the PCI PM register until it sticks

Benny Halevy (1):
      NFSD: provide encode routine for OP_OPENATTR

Bjørn Mork (1):
      ipv6: fix display of local and remote sit endpoints

Borislav Petkov (1):
      ide-floppy: do not map dataless cmds to an sg

Carlos Corbacho (2):
      acpi-wmi: Unmark as 'experimental'
      acer-wmi: Unmark as 'experimental'

Chris Leech (3):
      [SCSI] libfc: rport retry on LS_RJT from certain ELS
      [SCSI] fcoe: fix handling of pending queue, prevent out of order frames (v3)
      ixgbe: fix multiple unicast address support

Chris Mason (2):
      Btrfs: Fix locking around adding new space_info
      Btrfs: Clear space_info full when adding new devices

Christoph Paasch (2):
      netfilter: conntrack: fix dropping packet after l4proto->packet()
      netfilter: conntrack: check for NEXTHDR_NONE before header sanity checking

Chuck Lever (2):
      NLM: Shrink the IPv4-only version of nlm_cmp_addr()
      NLM: Fix GRANT callback address comparison when IPv6 is enabled

Corentin Chary (4):
      asus-laptop: restore acpi_generate_proc_event()
      eeepc-laptop: restore acpi_generate_proc_event()
      asus-laptop: use select instead of depends on
      platform/x86: depends instead of select for laptop platform drivers

Cyrill Gorcunov (1):
      acpi: check for pxm_to_node_map overflow

Daisuke Nishimura (1):
      vmscan: pgmoved should be cleared after updating recent_rotated

Dan Carpenter (1):
      acer-wmi: double free in acer_rfkill_exit()

Dan Williams (1):
      USB: Option: let cdc-acm handle Sony Ericsson F3507g / Dell 5530

Darius Augulis (1):
      MX1 fix include

Dave Jones (1):
      via-velocity: Fix DMA mapping length errors on transmit.

David Brownell (2):
      ARM: OMAP: Fix compile error if pm.h is included
      dm9000: locking bugfix

David S. Miller (3):
      dnet: Fix warnings on 64-bit.
      xfrm: Fix xfrm_state_find() wrt. wildcard source address.
      sparc64: Reschedule KGDB capture to a software interrupt.

Davide Libenzi (1):
      eventfd: remove fput() call from possible IRQ context

Dhananjay Phadke (1):
      netxen: remove old flash check.

Dirk Hohndel (1):
      USB: Add Vendor/Product ID for new CDMA U727 to option driver

Eilon Greenstein (3):
      bnx2x: Adding restriction on sge_buf_size
      bnx2x: Casting page alignment
      bnx2x: Using DMAE to initialize the chip

Enrik Berkhan (1):
      nommu: ramfs: pages allocated to an inode's pagecache may get wrongly discarded

Eric Sandeen (3):
      ext4: fix header check in ext4_ext_search_right() for deep extent trees.
      ext4: fix bogus BUG_ONs in in mballoc code
      ext4: fix bb_prealloc_list corruption due to wrong group locking

FUJITA Tomonori (1):
      ide: save the returned value of dma_map_sg

Geert Uytterhoeven (1):
      ps3/block: Replace mtd/ps3vram by block/ps3vram

Geoff Levand (1):
      powerpc/ps3: ps3_defconfig updates

Gerald Schaefer (1):
      [S390] Dont check for pfn_valid() in uaccess_pt.c

Gertjan van Wingerde (1):
      Update my email address

Grant Grundler (2):
      parisc: fix wrong assumption about bus->self
      parisc: update MAINTAINERS

Grant Likely (1):
      Fix Xilinx SystemACE driver to handle empty CF slot

Greg Kroah-Hartman (3):
      USB: usbtmc: fix stupid bug in open()
      USB: usbtmc: add protocol 1 support
      Staging: benet: remove driver now that it is merged in drivers/net/

Greg Ungerer (8):
      m68k: merge the non-MMU and MMU versions of param.h
      m68k: merge the non-MMU and MMU versions of swab.h
      m68k: merge the non-MMU and MMU versions of sigcontext.h
      m68k: use MMU version of setup.h for both MMU and non-MMU
      m68k: merge the non-MMU and MMU versions of ptrace.h
      m68k: merge the non-MMU and MMU versions of signal.h
      m68k: use the MMU version of unistd.h for all m68k platforms
      m68k: merge the non-MMU and MMU versions of siginfo.h

Gregory Lardiere (1):
      V4L/DVB (10789): m5602-s5k4aa: Split up the initial sensor probe in chunks.

Hans Werner (1):
      V4L/DVB (10977): STB6100 init fix, the call to stb6100_set_bandwidth needs an argument

Hartley Sweeten (1):
      [ARM] 5419/1: ep93xx: fix build warnings about struct i2c_board_info

Heiko Carstens (2):
      [S390] topology: define SD_MC_INIT to fix performance regression
      [S390] ftrace/mcount: fix kernel stack backchain

Helge Deller (7):
      parisc: BUG_ON() cleanup
      parisc: fix section mismatch warnings
      parisc: fix `struct pt_regs' declared inside parameter list warning
      parisc: remove unused local out_putf label
      parisc: fix dev_printk() compile warnings for accessing a device struct
      parisc: add braces around arguments in assembler macros
      parisc: fix 64bit build

Herbert Xu (1):
      gro: Fix legacy path napi_complete crash

Huang Ying (1):
      dm crypt: fix kcryptd_async_done parameter

Ian Dall (1):
      Bug 11061, NFS mounts dropped

Igor M. Liplianin (1):
      V4L/DVB (10976): Bug fix: For legacy applications stv0899 performs search only first time after insmod.

Ilya Yanok (3):
      dnet: Dave DNET ethernet controller driver (updated)
      dnet: replace obsolete *netif_rx_* functions with *napi_*
      dnet: DNET should depend on HAS_IOMEM

Ingo Molnar (1):
      kconfig: improve seed in randconfig

J. Bruce Fields (1):
      nfsd: nfsd should drop CAP_MKNOD for non-root

James Bottomley (1):
      parisc: remove klist iterators

Jan Dumon (1):
      USB: unusual_devs: Add support for GI 0431 SD-Card interface

Jay Vosburgh (1):
      bonding: Fix updating of speed/duplex changes

Jeff Moyer (1):
      aio: lookup_ioctx can return the wrong value when looking up a bogus context

Jiri Slaby (8):
      ACPI: remove doubled status checking
      USB: atm/cxacru, fix lock imbalance
      USB: image/mdc800, fix lock imbalance
      USB: misc/adutux, fix lock imbalance
      USB: misc/vstusb, fix lock imbalance
      USB: wusbcore/wa-xfer, fix lock imbalance
      ALSA: pcm_oss, fix locking typo
      ALSA: mixart, fix lock imbalance

Jody McIntyre (1):
      trivial: fix orphan dates in ext2 documentation

Johannes Weiner (3):
      HID: fix incorrect free in hiddev
      HID: fix waitqueue usage in hiddev
      nommu: ramfs: don't leak pages when adding to page cache fails

John Dykstra (1):
      ipv6:  Fix BUG when disabled ipv6 module is unloaded

John W. Linville (1):
      lib80211: silence excessive crypto debugging messages

Jorge Boncompte [DTI2] (1):
      netns: oops in ip[6]_frag_reasm incrementing stats

Jouni Malinen (3):
      mac80211: Fix panic on fragmentation with power saving
      zd1211rw: Do not panic on device eject when associated
      nl80211: Check that function pointer != NULL before using it

Karsten Wiese (1):
      USB: EHCI: Fix isochronous URB leak

Kay Sievers (1):
      parisc: dino: struct device - replace bus_id with dev_name(), dev_set_name()

Koen Kooi (1):
      ARM: OMAP: board-omap3beagle: set i2c-3 to 100kHz

Krzysztof Helt (1):
      ALSA: opl3sa2 - Fix NULL dereference when suspending snd_opl3sa2

Kumar Gala (2):
      powerpc/mm: Respect _PAGE_COHERENT on classic ppc32 SW
      powerpc/mm: Fix Respect _PAGE_COHERENT on classic ppc32 SW TLB load machines

Kyle McMartin (8):
      parisc: fix use of new cpumask api in irq.c
      parisc: convert (read|write)bwlq to inlines
      parisc: convert cpu_check_affinity to new cpumask api
      parisc: define x->x mmio accessors
      parisc: update defconfigs
      parisc: sba_iommu: fix build bug when CONFIG_PARISC_AGP=y
      tulip: fix crash on iface up with shirq debug
      Build with -fno-dwarf2-cfi-asm

Lalit Chandivade (1):
      [SCSI] qla2xxx: Use correct value for max vport in LOOP topology.

Len Brown (1):
      Revert "ACPI: make some IO ports off-limits to AML"

Lennert Buytenhek (1):
      mv643xx_eth: fix unicast address filter corruption on mtu change

Li Zefan (1):
      block: fix memory leak in bio_clone()

Linus Torvalds (7):
      Fix potential fast PIT TSC calibration startup glitch
      Fast TSC calibration: calculate proper frequency error bounds
      Avoid 64-bit "switch()" statements on 32-bit architectures
      Add '-fwrapv' to gcc CFLAGS
      Fix race in create_empty_buffers() vs __set_page_dirty_buffers()
      Move cc-option to below arch-specific setup
      Linux 2.6.29

Luis R. Rodriguez (2):
      ath9k: implement IO serialization
      ath9k: AR9280 PCI devices must serialize IO as well

Maciej Sosnowski (1):
      dca: add missing copyright/license headers

Manu Abraham (1):
      V4L/DVB (10975): Bug: Use signed types, Offsets and range can be negative

Mark Brown (5):
      [ARM] S3C64XX: Fix section mismatch for s3c64xx_register_clocks()
      [ARM] SMDK6410: Correct I2C device name for WM8580
      [ARM] SMDK6410: Declare iodesc table static
      [ARM] S3C64XX: Staticise s3c64xx_init_irq_eint()
      [ARM] S3C64XX: Do gpiolib configuration earlier

Mark Lord (1):
      sata_mv: fix MSI irq race condition

Martin Schwidefsky (3):
      [S390] __div64_31 broken for CONFIG_MARCH_G5
      [S390] make page table walking more robust
      [S390] make page table upgrade work again

Masami Hiramatsu (2):
      prevent boosting kprobes on exception address
      module: fix refptr allocation and release order

Mathieu Chouquet-Stringer (1):
      thinkpad-acpi: fix module autoloading for older models

Matthew Wilcox (1):
      [SCSI] sd: Don't try to spin up drives that are connected to an inactive port

Matthias Schwarzzot (1):
      V4L/DVB (10978): Report tuning algorith correctly

Mauro Carvalho Chehab (1):
      V4L/DVB (10834): zoran: auto-select bt866 for AverMedia 6 Eyes

Michael Chan (1):
      bnx2: Fix problem of using wrong IRQ handler.

Michael Hennerich (1):
      USB: serial: ftdi: enable UART detection on gnICE JTAG adaptors blacklist interface0

Mike Travis (1):
      parisc: update parisc for new irq_desc

Miklos Szeredi (1):
      fix ptrace slowness

Mikulas Patocka (3):
      dm table: rework reference counting fix
      dm io: respect BIO_MAX_PAGES limit
      sparc64: Fix crash with /proc/iomem

Milan Broz (2):
      dm ioctl: validate name length when renaming
      dm crypt: wait for endio to complete before destruction

Moritz Muehlenhoff (1):
      USB: Updated unusual-devs entry for USB mass storage on Nokia 6233

Nobuhiro Iwamatsu (2):
      sh_eth: Change handling of IRQ
      sh_eth: Fix mistake of the address of SH7763

Pablo Neira Ayuso (2):
      netfilter: conntrack: don't deliver events for racy packets
      netfilter: ctnetlink: fix crash during expectation creation

Pantelis Koukousoulas (1):
      virtio_net: Make virtio_net support carrier detection

Piotr Ziecik (1):
      powerpc/5200: Enable CPU_FTR_NEED_COHERENT for MPC52xx

Ralf Baechle (1):
      MIPS: Mark Eins: Fix configuration.

Robert Love (11):
      [SCSI] libfc: Don't violate transport template for rogue port creation
      [SCSI] libfc: correct RPORT_TO_PRIV usage
      [SCSI] libfc: rename rp to rdata in fc_disc_new_target()
      [SCSI] libfc: check for err when recv and state is incorrect
      [SCSI] libfc: Cleanup libfc_function_template comments
      [SCSI] libfc, fcoe: Fix kerneldoc comments
      [SCSI] libfc, fcoe: Cleanup function formatting and minor typos
      [SCSI] libfc, fcoe: Remove unnecessary cast by removing inline wrapper
      [SCSI] fcoe: Use setup_timer() and mod_timer()
      [SCSI] fcoe: Correct fcoe_transports initialization vs. registration
      [SCSI] fcoe: Change fcoe receive thread nice value from 19 (lowest priority) to -20

Robert M. Kenney (1):
      USB: serial: new cp2101 device id

Roel Kluin (3):
      [SCSI] fcoe: fix kfree(skb)
      acpi-wmi: unsigned cannot be less than 0
      net: kfree(napi->skb) => kfree_skb

Ron Mercer (4):
      qlge: bugfix: Increase filter on inbound csum.
      qlge: bugfix: Tell hw to strip vlan header.
      qlge: bugfix: Move netif_napi_del() to common call point.
      qlge: bugfix: Pad outbound frames smaller than 60 bytes.

Russell King (2):
      [ARM] update mach-types
      [ARM] Fix virtual to physical translation macro corner cases

Rusty Russell (1):
      linux.conf.au 2009: Tuz

Saeed Bishara (1):
      [ARM] orion5x: pass dram mbus data to xor driver

Sam Ravnborg (1):
      kconfig: fix randconfig for choice blocks

Sathya Perla (3):
      net: Add be2net driver.
      be2net: replenish when posting to rx-queue is starved in out of mem conditions
      be2net: fix to restore vlan ids into BE2 during a IF DOWN->UP cycle

Scott James Remnant (1):
      sbus: Auto-load openprom module when device opened.

Sigmund Augdal (1):
      V4L/DVB (10974): Use Diseqc 3/3 mode to send data

Stanislaw Gruszka (1):
      net: Document /proc/sys/net/core/netdev_budget

Stephen Hemminger (1):
      sungem: missing net_device_ops

Stephen Rothwell (1):
      net: update dnet.c for bus_id removal

Steve Glendinning (1):
      smsc911x: reset last known duplex and carrier on open

Steve Ma (1):
      [SCSI] libfc: exch mgr is freed while lport still retrying sequences

Stuart MENEFY (1):
      libata: Keep shadow last_ctl up to date during resets

Suresh Jayaraman (1):
      NFS: Handle -ESTALE error in access()

Takashi Iwai (3):
      ALSA: hda - Fix DMA mask for ATI controllers
      ALSA: hda - Workaround for buggy DMA position on ATI controllers
      ALSA: Fix vunmap and free order in snd_free_sgbuf_pages()

Tao Ma (2):
      ocfs2: Fix a bug found by sparse check.
      ocfs2: Use xs->bucket to set xattr value outside

Tejun Heo (1):
      ata_piix: add workaround for Samsung DB-P70

Theodore Ts'o (1):
      ext4: Print the find_group_flex() warning only once

Thomas Bartosik (1):
      USB: storage: Unusual USB device Prolific 2507 variation added

Tiger Yang (2):
      ocfs2: reserve xattr block for new directory with inline data
      ocfs2: tweak to get the maximum inline data size with xattr

Tilman Schmidt (1):
      bas_gigaset: correctly allocate USB interrupt transfer buffer

Trond Myklebust (6):
      SUNRPC: Tighten up the task locking rules in __rpc_execute()
      NFS: Fix misparsing of nfsv4 fs_locations attribute (take 2)
      NFSv3: Fix posix ACL code
      SUNRPC: Fix an Oops due to socket not set up yet...
      SUNRPC: xprt_connect() don't abort the task if the transport isn't bound
      NFS: Fix the fix to Bugzilla #11061, when IPv6 isn't defined...

Tyler Hicks (3):
      eCryptfs: don't encrypt file key with filename key
      eCryptfs: Allocate a variable number of pages for file headers
      eCryptfs: NULL crypt_stat dereference during lookup

Uwe Kleine-König (2):
      [ARM] 5418/1: restore lr before leaving mcount
      [ARM] 5421/1: ftrace: fix crash due to tracing of __naked functions

Vasu Dev (5):
      [SCSI] libfc: handle RRQ exch timeout
      [SCSI] libfc: fixed a soft lockup issue in fc_exch_recv_abts
      [SCSI] libfc, fcoe: fixed locking issues with lport->lp_mutex around lport->link_status
      [SCSI] libfc: fixed a read IO data integrity issue when a IO data frame lost
      [SCSI] fcoe: Out of order tx frames was causing several check condition SCSI status

Viral Mehta (1):
      ALSA: oss-mixer - Fixes recording gain control

Vitaly Wool (1):
      V4L/DVB (10832): tvaudio: Avoid breakage with tda9874a

Werner Almesberger (1):
      [ARM] S3C64XX: Fix s3c64xx_setrate_clksrc

Yi Zou (2):
      [SCSI] libfc: do not change the fh_rx_id of a recevied frame
      [SCSI] fcoe: ETH_P_8021Q is already in if_ether and fcoe is not using it anyway

Zhang Le (2):
      MIPS: Fix TIF_32BIT undefined problem when seccomp is disabled
      filp->f_pos not correctly updated in proc_task_readdir

Zhang Rui (1):
      ACPI suspend: Blacklist Toshiba Satellite L300 that requires to set SCI_EN directly on resume

françois romieu (2):
      r8169: use hardware auto-padding.
      r8169: revert "r8169: read MAC address from EEPROM on init (2nd attempt)"

un'ichi Nomura (1):
      block: Add gfp_mask parameter to bio_integrity_clone()

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-23 23:29 Linux 2.6.29 Linus Torvalds
@ 2009-03-24  6:19 ` Jesper Krogh
  2009-03-24  6:46   ` David Rees
  2009-04-02 14:00   ` Mathieu Desnoyers
  2009-03-24 13:02 ` Revert "gro: Fix legacy path napi_complete crash", (was: Re: Linux 2.6.29) Ingo Molnar
  2009-03-27 13:35 ` Linux 2.6.29 Hans-Peter Jansen
  2 siblings, 2 replies; 664+ messages in thread
From: Jesper Krogh @ 2009-03-24  6:19 UTC (permalink / raw)
  To: Linus Torvalds, Linux Kernel Mailing List

Linus Torvalds wrote:
> This obviously starts the merge window for 2.6.30, although as usual, I'll 
> probably wait a day or two before I start actively merging. I do that in 
> order to hopefully result in people testing the final plain 2.6.29 a bit 
> more before all the crazy changes start up again.

I know this has been discussed before:

[129401.996244] INFO: task updatedb.mlocat:31092 blocked for more than 
480 seconds.
[129402.084667] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" 
disables this message.
[129402.179331] updatedb.mloc D 0000000000000000     0 31092  31091
[129402.179335]  ffff8805ffa1d900 0000000000000082 ffff8803ff5688a8 
0000000000001000
[129402.179338]  ffffffff806cc000 ffffffff806cc000 ffffffff806d3e80 
ffffffff806d3e80
[129402.179341]  ffffffff806cfe40 ffffffff806d3e80 ffff8801fb9f87e0 
000000000000ffff
[129402.179343] Call Trace:
[129402.179353]  [<ffffffff802d3ff0>] sync_buffer+0x0/0x50
[129402.179358]  [<ffffffff80493a50>] io_schedule+0x20/0x30
[129402.179360]  [<ffffffff802d402b>] sync_buffer+0x3b/0x50
[129402.179362]  [<ffffffff80493d2f>] __wait_on_bit+0x4f/0x80
[129402.179364]  [<ffffffff802d3ff0>] sync_buffer+0x0/0x50
[129402.179366]  [<ffffffff80493dda>] out_of_line_wait_on_bit+0x7a/0xa0
[129402.179369]  [<ffffffff80252730>] wake_bit_function+0x0/0x30
[129402.179396]  [<ffffffffa0264346>] ext3_find_entry+0xf6/0x610 [ext3]
[129402.179399]  [<ffffffff802d3453>] __find_get_block+0x83/0x170
[129402.179403]  [<ffffffff802c4a90>] ifind_fast+0x50/0xa0
[129402.179405]  [<ffffffff802c5874>] iget_locked+0x44/0x180
[129402.179412]  [<ffffffffa0266435>] ext3_lookup+0x55/0x100 [ext3]
[129402.179415]  [<ffffffff802c32a7>] d_alloc+0x127/0x1c0
[129402.179417]  [<ffffffff802ba2a7>] do_lookup+0x1b7/0x250
[129402.179419]  [<ffffffff802bc51d>] __link_path_walk+0x76d/0xd60
[129402.179421]  [<ffffffff802ba17f>] do_lookup+0x8f/0x250
[129402.179424]  [<ffffffff802c8b37>] mntput_no_expire+0x27/0x150
[129402.179426]  [<ffffffff802bcb64>] path_walk+0x54/0xb0
[129402.179428]  [<ffffffff802bfd10>] filldir+0x0/0xf0
[129402.179430]  [<ffffffff802bcc8a>] do_path_lookup+0x7a/0x150
[129402.179432]  [<ffffffff802bbb55>] getname+0xe5/0x1f0
[129402.179434]  [<ffffffff802bd8d4>] user_path_at+0x44/0x80
[129402.179437]  [<ffffffff802b53b5>] cp_new_stat+0xe5/0x100
[129402.179440]  [<ffffffff802b56d0>] vfs_lstat_fd+0x20/0x60
[129402.179442]  [<ffffffff802b5737>] sys_newlstat+0x27/0x50
[129402.179445]  [<ffffffff8020c35b>] system_call_fastpath+0x16/0x1b

Consensus seems to be something with large memory machines, lots of 
dirty pages and a long writeout time due to ext3.

At the moment this the largest "usabillity" issue in the serversetup I'm 
working with. Can there be done something to "autotune" it .. or perhaps 
even fix it? .. or is it just to shift to xfs or wait for ext4?

Jesper
-- 
Jesper

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-24  6:19 ` Jesper Krogh
@ 2009-03-24  6:46   ` David Rees
  2009-03-24  7:32     ` Jesper Krogh
  2009-03-24  9:15     ` Alan Cox
  2009-04-02 14:00   ` Mathieu Desnoyers
  1 sibling, 2 replies; 664+ messages in thread
From: David Rees @ 2009-03-24  6:46 UTC (permalink / raw)
  To: Jesper Krogh; +Cc: Linus Torvalds, Linux Kernel Mailing List

On Mon, Mar 23, 2009 at 11:19 PM, Jesper Krogh <jesper@krogh.cc> wrote:
> I know this has been discussed before:
>
> [129401.996244] INFO: task updatedb.mlocat:31092 blocked for more than 480
> seconds.

Ouch - 480 seconds, how much memory is in that machine, and how slow
are the disks? What's your vm.dirty_background_ratio and
vm.dirty_ratio set to?

> Consensus seems to be something with large memory machines, lots of dirty
> pages and a long writeout time due to ext3.

All filesystems seem to suffer from this issue to some degree.  I
posted to the list earlier trying to see if there was anything that
could be done to help my specific case.  I've got a system where if
someone starts writing out a large file, it kills client NFS writes.
Makes the system unusable:
http://marc.info/?l=linux-kernel&m=123732127919368&w=2

Only workaround I've found is to reduce dirty_background_ratio and
dirty_ratio to tiny levels.  Or throw good SSDs and/or a fast RAID
array at it so that large writes complete faster.  Have you tried the
new vm_dirty_bytes in 2.6.29?

> At the moment this the largest "usabillity" issue in the serversetup I'm
> working with. Can there be done something to "autotune" it .. or perhaps
> even fix it? .. or is it just to shift to xfs or wait for ext4?

Everyone seems to agree that "autotuning" it is the way to go.  But no
one seems willing to step up and try to do it.  Probably because it's
hard to get right!

-Dave

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-24  6:46   ` David Rees
@ 2009-03-24  7:32     ` Jesper Krogh
  2009-03-24  8:16       ` Ingo Molnar
  2009-03-24 19:00       ` David Rees
  2009-03-24  9:15     ` Alan Cox
  1 sibling, 2 replies; 664+ messages in thread
From: Jesper Krogh @ 2009-03-24  7:32 UTC (permalink / raw)
  To: David Rees; +Cc: Linus Torvalds, Linux Kernel Mailing List

David Rees wrote:
> On Mon, Mar 23, 2009 at 11:19 PM, Jesper Krogh <jesper@krogh.cc> wrote:
>> I know this has been discussed before:
>>
>> [129401.996244] INFO: task updatedb.mlocat:31092 blocked for more than 480
>> seconds.
> 
> Ouch - 480 seconds, how much memory is in that machine, and how slow
> are the disks? 

The 480 secondes is not the "wait time" but the time gone before the
message is printed. It the kernel-default it was earlier 120 seconds but
thats changed by Ingo Molnar back in september. I do get a lot of less
noise but it really doesn't tell anything about the nature of the problem.

The systes spec:
32GB of memory. The disks are a Nexsan SataBeast with 42 SATA drives in 
Raid10 connected using 4Gbit fibre-channel. I'll let it up to you to 
decide if thats fast or slow?

The strange thing is actually that the above process (updatedb.mlocate) 
is writing to / which is a device without any activity at all. All 
activity is on the Fibre Channel device above, but process writing 
outsid that seems to be effected as well.

 > What's your vm.dirty_background_ratio and
> vm.dirty_ratio set to?

2.6.29-rc8 defaults:
jk@hest:/proc/sys/vm$ cat dirty_background_ratio
5
jk@hest:/proc/sys/vm$ cat dirty_ratio
10

>> Consensus seems to be something with large memory machines, lots of dirty
>> pages and a long writeout time due to ext3.
> 
> All filesystems seem to suffer from this issue to some degree.  I
> posted to the list earlier trying to see if there was anything that
> could be done to help my specific case.  I've got a system where if
> someone starts writing out a large file, it kills client NFS writes.
> Makes the system unusable:
> http://marc.info/?l=linux-kernel&m=123732127919368&w=2

Yes, I've hit 120s+ penalties just by saving a file in vim.

> Only workaround I've found is to reduce dirty_background_ratio and
> dirty_ratio to tiny levels.  Or throw good SSDs and/or a fast RAID
> array at it so that large writes complete faster.  Have you tried the
> new vm_dirty_bytes in 2.6.29?

No.. What would you suggest to be a reasonable setting for that?

 > Everyone seems to agree that "autotuning" it is the way to go.  But no
 > one seems willing to step up and try to do it.  Probably because it's
 > hard to get right!

I can test patches.. but I'm not a kernel-developer.. unfortunately.

Jesper

-- 
Jesper

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-24  7:32     ` Jesper Krogh
@ 2009-03-24  8:16       ` Ingo Molnar
  2009-03-24 11:10         ` Jesper Krogh
  2009-03-24 19:00       ` David Rees
  1 sibling, 1 reply; 664+ messages in thread
From: Ingo Molnar @ 2009-03-24  8:16 UTC (permalink / raw)
  To: Jesper Krogh; +Cc: David Rees, Linus Torvalds, Linux Kernel Mailing List


* Jesper Krogh <jesper@krogh.cc> wrote:

> David Rees wrote:
>> On Mon, Mar 23, 2009 at 11:19 PM, Jesper Krogh <jesper@krogh.cc> wrote:
>>> I know this has been discussed before:
>>>
>>> [129401.996244] INFO: task updatedb.mlocat:31092 blocked for more than 480
>>> seconds.
>>
>> Ouch - 480 seconds, how much memory is in that machine, and how slow
>> are the disks? 
>
> The 480 secondes is not the "wait time" but the time gone before 
> the message is printed. It the kernel-default it was earlier 120 
> seconds but thats changed by Ingo Molnar back in september. I do 
> get a lot of less noise but it really doesn't tell anything about 
> the nature of the problem.

That's true - the detector is really simple and only tries to flag 
suspiciously long uninterruptible waits. It prints out the context 
it finds but otherwise does not try to go deep about exactly why 
that delay happened.

Would you agree that the message is correct, and that there is some 
sort of "tasks wait way too long" problem on your system?

Considering:

> The systes spec:
> 32GB of memory. The disks are a Nexsan SataBeast with 42 SATA drives in  
> Raid10 connected using 4Gbit fibre-channel. I'll let it up to you to  
> decide if thats fast or slow?
[...]
> Yes, I've hit 120s+ penalties just by saving a file in vim.

i think it's fair to say that an almost 10 minutes uninterruptible 
sleep sucks to the user, by any reasonable standard. It is the year 
2009, not 1959.

The delay might be difficult to fix, but it's still reality - and 
that's the purpose of this particular debug helper: to rub reality 
under our noses, whether we like it or not.

( _My_ personal pain threshold for waiting for the computer is 
  around 1 _second_. If any command does something that i cannot
  Ctrl-C or Ctrl-Z my way out of i get annoyed. So the historic 
  limit for the hung tasks check was 10 seconds, then 60 seconds. 
  But people argued that it's too low so it was raised to 120 then 
  480 seconds. If almost 10 minutes of uninterruptible wait is still 
  acceptable then the watchdog can be turned off (because it's 
  basically pointless to run it in that case - no amount of delay 
  will be 'bad'). )

	Ingo

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-24  6:46   ` David Rees
  2009-03-24  7:32     ` Jesper Krogh
@ 2009-03-24  9:15     ` Alan Cox
  2009-03-24  9:32       ` Ingo Molnar
  1 sibling, 1 reply; 664+ messages in thread
From: Alan Cox @ 2009-03-24  9:15 UTC (permalink / raw)
  To: David Rees; +Cc: Jesper Krogh, Linus Torvalds, Linux Kernel Mailing List

> posted to the list earlier trying to see if there was anything that
> could be done to help my specific case.  I've got a system where if
> someone starts writing out a large file, it kills client NFS writes.
> Makes the system unusable:
> http://marc.info/?l=linux-kernel&m=123732127919368&w=2

I have not had this problem since I applied Arjan's (for some reason
repeatedly rejected) patch to change the ioprio of the various writeback
daemons. Under some loads changing to the noop I/O scheduler also seems
to help (as do most of the non default ones)

> Everyone seems to agree that "autotuning" it is the way to go.  But no
> one seems willing to step up and try to do it.  Probably because it's
> hard to get right!

If this is a VM problem why does fixing the I/O priority of the various
daemons seem to cure at least some of it ?

Alan

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-24  9:15     ` Alan Cox
@ 2009-03-24  9:32       ` Ingo Molnar
  2009-03-24 10:10         ` Alan Cox
  0 siblings, 1 reply; 664+ messages in thread
From: Ingo Molnar @ 2009-03-24  9:32 UTC (permalink / raw)
  To: Alan Cox
  Cc: David Rees, Jesper Krogh, Linus Torvalds, Linux Kernel Mailing List


* Alan Cox <alan@lxorguk.ukuu.org.uk> wrote:

> > posted to the list earlier trying to see if there was anything that
> > could be done to help my specific case.  I've got a system where if
> > someone starts writing out a large file, it kills client NFS writes.
> > Makes the system unusable:
> > http://marc.info/?l=linux-kernel&m=123732127919368&w=2
> 
> I have not had this problem since I applied Arjan's (for some reason
> repeatedly rejected) patch to change the ioprio of the various writeback
> daemons. Under some loads changing to the noop I/O scheduler also seems
> to help (as do most of the non default ones)

(link would be useful)

	Ingo

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-24  9:32       ` Ingo Molnar
@ 2009-03-24 10:10         ` Alan Cox
  2009-03-24 10:31           ` Ingo Molnar
  2009-03-24 12:27           ` Andi Kleen
  0 siblings, 2 replies; 664+ messages in thread
From: Alan Cox @ 2009-03-24 10:10 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: David Rees, Jesper Krogh, Linus Torvalds, Linux Kernel Mailing List

> > I have not had this problem since I applied Arjan's (for some reason
> > repeatedly rejected) patch to change the ioprio of the various writeback
> > daemons. Under some loads changing to the noop I/O scheduler also seems
> > to help (as do most of the non default ones)
> 
> (link would be useful)


"Give kjournald a IOPRIO_CLASS_RT io priority"

October 2007 (yes its that old)

And do the same as per discussion to the writeback tasks.


Which isn't to say there are not also vm problems - look at the I/O
patterns with any kernel after about 2.6.18/19 and there seems to be a
serious problem with writeback from the mm and fs writes falling over
each other and turning the smooth writeout into thrashing back and forth
as both try to write out different bits of the same stuff.

<Rant>
Really someone needs to sit down and actually build a proper model of the
VM behaviour in a tool like netlogo rather than continually keep adding
ever more complex and thus unpredictable hacks to it. That way we might
better understand what is occurring and why.
</Rant>

Alan


^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-24 10:10         ` Alan Cox
@ 2009-03-24 10:31           ` Ingo Molnar
  2009-03-24 11:12             ` Andrew Morton
  2009-03-24 13:20             ` Theodore Tso
  2009-03-24 12:27           ` Andi Kleen
  1 sibling, 2 replies; 664+ messages in thread
From: Ingo Molnar @ 2009-03-24 10:31 UTC (permalink / raw)
  To: Alan Cox, Arjan van de Ven, Andrew Morton, Peter Zijlstra,
	Nick Piggin, Theodore Tso, Jens Axboe
  Cc: David Rees, Jesper Krogh, Linus Torvalds, Linux Kernel Mailing List


* Alan Cox <alan@lxorguk.ukuu.org.uk> wrote:

> > > I have not had this problem since I applied Arjan's (for some reason
> > > repeatedly rejected) patch to change the ioprio of the various writeback
> > > daemons. Under some loads changing to the noop I/O scheduler also seems
> > > to help (as do most of the non default ones)
> > 
> > (link would be useful)
> 
> 
> "Give kjournald a IOPRIO_CLASS_RT io priority"
> 
> October 2007 (yes its that old)

thx. A more recent submission from Arjan would be:

    http://lkml.org/lkml/2008/10/1/405

Resolution was that Tytso indicated it went into some sort of ext4 
patch queue:

| I've ported the patch to the ext4 filesystem, and dropped it into 
| the unstable portion of the ext4 patch queue.
|
|   ext4: akpm's locking hack to fix locking delays

but 6 months down the line and i can find no trace of this upstream 
anywhere.

<let-me-rant-too>

The thing is ... this is a _bad_ ext3 design bug affecting ext3 
users in the last decade or so of ext3 existence. Why is this issue 
not handled with the utmost high priority and why wasnt it fixed 5 
years ago already? :-)

It does not matter whether we have extents or htrees when there are 
_trivially reproducible_ basic usability problems with ext3.

	Ingo

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-24  8:16       ` Ingo Molnar
@ 2009-03-24 11:10         ` Jesper Krogh
  0 siblings, 0 replies; 664+ messages in thread
From: Jesper Krogh @ 2009-03-24 11:10 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: David Rees, Linus Torvalds, Linux Kernel Mailing List

Ingo Molnar wrote:
> * Jesper Krogh <jesper@krogh.cc> wrote:
> 
>> David Rees wrote:
>>> On Mon, Mar 23, 2009 at 11:19 PM, Jesper Krogh <jesper@krogh.cc> wrote:
>>>> I know this has been discussed before:
>>>>
>>>> [129401.996244] INFO: task updatedb.mlocat:31092 blocked for more than 480
>>>> seconds.
>>> Ouch - 480 seconds, how much memory is in that machine, and how slow
>>> are the disks? 
>> The 480 secondes is not the "wait time" but the time gone before 
>> the message is printed. It the kernel-default it was earlier 120 
>> seconds but thats changed by Ingo Molnar back in september. I do 
>> get a lot of less noise but it really doesn't tell anything about 
>> the nature of the problem.
> 
> That's true - the detector is really simple and only tries to flag 
> suspiciously long uninterruptible waits. It prints out the context 
> it finds but otherwise does not try to go deep about exactly why 
> that delay happened.
> 
> Would you agree that the message is correct, and that there is some 
> sort of "tasks wait way too long" problem on your system?

The message is absolutely correct (it was even at 120s).. thats too long
for what I consider good.

> Considering:
> 
>> The systes spec:
>> 32GB of memory. The disks are a Nexsan SataBeast with 42 SATA drives in  
>> Raid10 connected using 4Gbit fibre-channel. I'll let it up to you to  
>> decide if thats fast or slow?
> [...]
>> Yes, I've hit 120s+ penalties just by saving a file in vim.
> 
> i think it's fair to say that an almost 10 minutes uninterruptible 
> sleep sucks to the user, by any reasonable standard. It is the year 
> 2009, not 1959.
> 
> The delay might be difficult to fix, but it's still reality - and 
> that's the purpose of this particular debug helper: to rub reality 
> under our noses, whether we like it or not.
 >
> ( _My_ personal pain threshold for waiting for the computer is 
>   around 1 _second_. If any command does something that i cannot
>   Ctrl-C or Ctrl-Z my way out of i get annoyed. So the historic 
>   limit for the hung tasks check was 10 seconds, then 60 seconds. 
>   But people argued that it's too low so it was raised to 120 then 
>   480 seconds. If almost 10 minutes of uninterruptible wait is still 
>   acceptable then the watchdog can be turned off (because it's 
>   basically pointless to run it in that case - no amount of delay 
>   will be 'bad'). )

Thats about the same definitions for me. But I can accept that if I 
happen to be doing something really crazy.. but this is merely about 
reading some files in and generating indexes out of them. None of the 
file are "huge".. < 15GB for the top 3, average < 100MB.

-- 
Jesper

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-24 10:31           ` Ingo Molnar
@ 2009-03-24 11:12             ` Andrew Morton
  2009-03-24 12:23               ` Alan Cox
                                 ` (2 more replies)
  2009-03-24 13:20             ` Theodore Tso
  1 sibling, 3 replies; 664+ messages in thread
From: Andrew Morton @ 2009-03-24 11:12 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Alan Cox, Arjan van de Ven, Peter Zijlstra, Nick Piggin,
	Theodore Tso, Jens Axboe, David Rees, Jesper Krogh,
	Linus Torvalds, Linux Kernel Mailing List

On Tue, 24 Mar 2009 11:31:11 +0100 Ingo Molnar <mingo@elte.hu> wrote:

> 
> * Alan Cox <alan@lxorguk.ukuu.org.uk> wrote:
> 
> > > > I have not had this problem since I applied Arjan's (for some reason
> > > > repeatedly rejected) patch to change the ioprio of the various writeback
> > > > daemons. Under some loads changing to the noop I/O scheduler also seems
> > > > to help (as do most of the non default ones)
> > > 
> > > (link would be useful)
> > 
> > 
> > "Give kjournald a IOPRIO_CLASS_RT io priority"
> > 
> > October 2007 (yes its that old)
> 
> thx. A more recent submission from Arjan would be:
> 
>     http://lkml.org/lkml/2008/10/1/405
> 
> Resolution was that Tytso indicated it went into some sort of ext4 
> patch queue:
> 
> | I've ported the patch to the ext4 filesystem, and dropped it into 
> | the unstable portion of the ext4 patch queue.
> |
> |   ext4: akpm's locking hack to fix locking delays
> 
> but 6 months down the line and i can find no trace of this upstream 
> anywhere.
> 
> <let-me-rant-too>
> 
> The thing is ... this is a _bad_ ext3 design bug affecting ext3 
> users in the last decade or so of ext3 existence. Why is this issue 
> not handled with the utmost high priority and why wasnt it fixed 5 
> years ago already? :-)
> 
> It does not matter whether we have extents or htrees when there are 
> _trivially reproducible_ basic usability problems with ext3.
> 

It's all there in that Oct 2008 thread.

The proposed tweak to kjournald is a bad fix - partly because it will
elevate the priority of vast amounts of IO whose priority we don't _want_
elevated.

But mainly because the problem lies elsewhere - in an area of contention
between the committing and running transactions which we knowingly and
reluctantly added to fix a bug in 

commit 773fc4c63442fbd8237b4805627f6906143204a8
Author:     akpm <akpm>
AuthorDate: Sun May 19 23:23:01 2002 +0000
Commit:     akpm <akpm>
CommitDate: Sun May 19 23:23:01 2002 +0000

    [PATCH] fix ext3 buffer-stealing
    
    Patch from sct fixes a long-standing (I did it!) and rather complex
    problem with ext3.
    
    The problem is to do with buffers which are continually being dirtied
    by an external agent.  I had code in there (for easily-triggerable
    livelock avoidance) which steals the buffer from checkpoint mode and
    reattaches it to the running transaction.  This violates ext3 ordering
    requirements - it can permit journal space to be reclaimed before the
    relevant data has really been written out.
    
    Also, we do have to reliably get a lock on the buffer when moving it
    between lists and inspecting its internal state.  Otherwise a competing
    read from the underlying block device can trigger an assertion failure,
    and a competing write to the underlying block device can confuse ext3
    journalling state completely.
    

Now this:

> Resolution was that Tytso indicated it went into some sort of ext4 
> patch queue:

was not a fix at all.  It was a known-buggy hack which I proposed simply to
remove that contention point to let us find out if we're on the right
track.  IIRC Ric was going to ask someone to do some performance testing of
that hack, but we never heard back.

The bottom line is that someone needs to do some serious rooting through
the very heart of JBD transaction logic and nobody has yet put their hand
up.  If we do that, and it turns out to be just too hard to fix then yes,
perhaps that's the time to start looking at palliative bandaids.

The number of people who can be looked at to do serious ext3/JBD work is
pretty small now.  Ted, Stephen and I got old and died.  Jan does good work
but is spread thinly.


^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-24 11:12             ` Andrew Morton
@ 2009-03-24 12:23               ` Alan Cox
  2009-03-24 13:37               ` Theodore Tso
  2009-03-25 12:37               ` Jan Kara
  2 siblings, 0 replies; 664+ messages in thread
From: Alan Cox @ 2009-03-24 12:23 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Ingo Molnar, Arjan van de Ven, Peter Zijlstra, Nick Piggin,
	Theodore Tso, Jens Axboe, David Rees, Jesper Krogh,
	Linus Torvalds, Linux Kernel Mailing List

> The proposed tweak to kjournald is a bad fix - partly because it will
> elevate the priority of vast amounts of IO whose priority we don't _want_
> elevated.

Its a huge improvement in practice because it both fixes the stupid
stalls and smooths out the rest of the I/O traffic. I spend a lot of my
time looking at what the disk driver is getting fed and its not a good
mix. Even more revealing is the noop scheduler and the fact this
frequently outperforms all the fancy I/O scheduling we do even on
relatively dumb hardware (as well as showing how mixed up our I/O
patterns currently are).

> But mainly because the problem lies elsewhere - in an area of contention
> between the committing and running transactions which we knowingly and
> reluctantly added to fix a bug in

The problem emerges about 2007 not 2002, so its not that simple.

> The number of people who can be looked at to do serious ext3/JBD work is
> pretty small now.  Ted, Stephen and I got old and died.  Jan does good work
> but is spread thinly.

Which is all the more reason to use a temporary fix in the meantime so
the OS is usable. I think its pretty poor that for over a year those in
the know who need a good performing system are having to apply out of
tree trivial patches rejected on the basis that "eventually like maybe
whenever perhaps we'll possibly some day you know consider fixing this,
but don't hold your breath"

There is a second reason to do this: If ext4 is the future then it is far
better to fix this stuff in ext4 properly and leave ext3 clear of
extremely invasive high risk fixes when a quick bandaid will do just fine
for the remaining lifetime of fs/jbd

Also not kjournald is only one of the afflicted threads - the same is
true of the crypto, and of the vm writeback. Also note the other point
about the disk scheduler defaults being terrible for some streaming I/O
patterns and the patch for that is also stuck in bugzilla.

If picking "no-op" speeds up my generic x86 box with random onboard SATA
we are doing something very non-optimal

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-24 10:10         ` Alan Cox
  2009-03-24 10:31           ` Ingo Molnar
@ 2009-03-24 12:27           ` Andi Kleen
  1 sibling, 0 replies; 664+ messages in thread
From: Andi Kleen @ 2009-03-24 12:27 UTC (permalink / raw)
  To: Alan Cox
  Cc: Ingo Molnar, David Rees, Jesper Krogh, Linus Torvalds,
	Linux Kernel Mailing List

Alan Cox <alan@lxorguk.ukuu.org.uk> writes:

>> > I have not had this problem since I applied Arjan's (for some reason
>> > repeatedly rejected) patch to change the ioprio of the various writeback
>> > daemons. Under some loads changing to the noop I/O scheduler also seems
>> > to help (as do most of the non default ones)
>> 
>> (link would be useful)
>
>
> "Give kjournald a IOPRIO_CLASS_RT io priority"
>
> October 2007 (yes its that old)

One issue discussed back then (also for a similar XFS patch) was
that having the kernel use the RT priorities by default makes
them useless as user override.

The proposal was to have a new priority level between normal and RT
for this, but noone implemented this.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Revert "gro: Fix legacy path napi_complete crash", (was: Re: Linux 2.6.29)
  2009-03-23 23:29 Linux 2.6.29 Linus Torvalds
  2009-03-24  6:19 ` Jesper Krogh
@ 2009-03-24 13:02 ` Ingo Molnar
  2009-03-24 13:12     ` Ingo Molnar
                     ` (2 more replies)
  2009-03-27 13:35 ` Linux 2.6.29 Hans-Peter Jansen
  2 siblings, 3 replies; 664+ messages in thread
From: Ingo Molnar @ 2009-03-24 13:02 UTC (permalink / raw)
  To: Linus Torvalds, Herbert Xu, Frank Blaschka, David S. Miller,
	Thomas Gleixner, Peter Zijlstra
  Cc: Linux Kernel Mailing List

[-- Attachment #1: Type: text/plain, Size: 4030 bytes --]


Yesterday about half of my testboxes (3 out of 7) started getting 
weird networking failures: their network interface just got stuck 
completely - no rx and no tx at all. Restarting the interface did 
not help.

The failures were highly sporadic and not reproducible - they 
triggered in distcc workloads, and on random kernels and seemingly 
random .config's.

After spending most of today trying to find a good reproducer (my 
regular tests werent specific enough to catch it in any bisectable 
manner), i settled down on 4 parallel instances of TCP traffic:

  nohup ssh testbox yes &
  nohup ssh testbox yes &
  nohup ssh testbox yes &
  nohup ssh testbox yes &

  [ over gigabit, forcedeth driver. ]

If the box hung within 15 minutes, the kernel was deemed bad. Using 
that method i arrived to this upstream networking fix which was 
merged yesterday:

 | 303c6a0251852ecbdc5c15e466dcaff5971f7517 is first bad commit
 | commit 303c6a0251852ecbdc5c15e466dcaff5971f7517
 | Author: Herbert Xu <herbert@gondor.apana.org.au>
 | Date:   Tue Mar 17 13:11:29 2009 -0700
 |
 |     gro: Fix legacy path napi_complete crash

Applying the straight revert below cured the problem - i now have 10 
million packets and 30 minutes of uptime and the box is still fine.

bisection log:

 [   10 iterations ] good: 73bc6e1: Merge branch 'linus'
 [    3 iterations ]  bad: 4eac7d0: Merge branch 'irq/threaded'
 [ 6.0m packets    ] good: e17bbdb: Merge branch 'tracing/core'
 [ 0.1m packets    ]  bad: 8e0ee43: Linux 2.6.29
 [ 0.1m packets    ]  bad: e2fc4d1: dca: add missing copyright/license headers
 [ 0.2m packets    ]  bad: 4783256: virtio_net: Make virtio_net support carrier detection
 [ 0.4m packets    ]  bad: 4ada810: Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/kaber/nf
 [ 7.0m packets    ] good: ec8d540: netfilter: conntrack: fix dropping packet after l4proto->packet()
 [ 4.0m packets    ] good: d1238d5: netfilter: conntrack: check for NEXTHDR_NONE before header sanity checking
 [ 0.1m packets    ]  bad: 303c6a0: gro: Fix legacy path napi_complete crash

   (the first column is millions of packets tested.)

Looking at this commit also explains the assymetric test pattern i 
found amongst boxes: all boxes with a new-style NAPI driver (e1000e) 
work - the others (forcedeth, 5c9x/vortex) have stuck interfaces.

I've attached the reproducer (non-SMP) .config. The system has:

00:0a.0 Bridge: nVidia Corporation CK804 Ethernet Controller (rev a3)

[   34.722154] forcedeth: Reverse Engineered nForce ethernet driver. Version 0.62.
[   34.729406] forcedeth 0000:00:0a.0: setting latency timer to 64
[   34.735320] nv_probe: set workaround bit for reversed mac addr
[   35.265783] PM: Adding info for No Bus:eth0
[   35.270877] forcedeth 0000:00:0a.0: ifname eth0, PHY OUI 0x5043 @ 1, addr 00:13:d4:dc:41:12
[   35.279086] forcedeth 0000:00:0a.0: highdma csum timirq gbit lnktim desc-v3
[   35.286273] initcall init_nic+0x0/0x16 returned 0 after 550966 usecs

( but the bug does not seem to be driver specific - old-style NAPI 
  seems to be enough to trigger it. )

Please let me know if you need more info or if i can help with 
testing a different patch. Bisecting it was hard, but testing 
whether a fix patch does the trick will be a lot easier, as all
the testboxes are back in working order now.

Thanks,

	Ingo

Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
 net/core/dev.c |    5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

Index: linux2/net/core/dev.c
===================================================================
--- linux2.orig/net/core/dev.c
+++ linux2/net/core/dev.c
@@ -2588,9 +2588,9 @@ static int process_backlog(struct napi_s
 		local_irq_disable();
 		skb = __skb_dequeue(&queue->input_pkt_queue);
 		if (!skb) {
+			__napi_complete(napi);
 			local_irq_enable();
-			napi_complete(napi);
-			goto out;
+			break;
 		}
 		local_irq_enable();
 
@@ -2599,7 +2599,6 @@ static int process_backlog(struct napi_s
 
 	napi_gro_flush(napi);
 
-out:
 	return work;
 }
 

[-- Attachment #2: config --]
[-- Type: text/plain, Size: 67719 bytes --]

#
# Automatically generated make config: don't edit
# Linux kernel version: 2.6.29-rc7
# Tue Mar 24 13:47:49 2009
#
# CONFIG_64BIT is not set
CONFIG_X86_32=y
# CONFIG_X86_64 is not set
CONFIG_X86=y
CONFIG_ARCH_DEFCONFIG="arch/x86/configs/i386_defconfig"
CONFIG_GENERIC_TIME=y
CONFIG_GENERIC_CMOS_UPDATE=y
CONFIG_CLOCKSOURCE_WATCHDOG=y
CONFIG_GENERIC_CLOCKEVENTS=y
CONFIG_GENERIC_CLOCKEVENTS_BROADCAST=y
CONFIG_LOCKDEP_SUPPORT=y
CONFIG_STACKTRACE_SUPPORT=y
CONFIG_HAVE_LATENCYTOP_SUPPORT=y
CONFIG_FAST_CMPXCHG_LOCAL=y
CONFIG_MMU=y
CONFIG_ZONE_DMA=y
CONFIG_GENERIC_ISA_DMA=y
CONFIG_GENERIC_IOMAP=y
CONFIG_GENERIC_HWEIGHT=y
CONFIG_GENERIC_GPIO=y
CONFIG_ARCH_MAY_HAVE_PC_FDC=y
# CONFIG_RWSEM_GENERIC_SPINLOCK is not set
CONFIG_RWSEM_XCHGADD_ALGORITHM=y
CONFIG_ARCH_HAS_CPU_IDLE_WAIT=y
CONFIG_GENERIC_CALIBRATE_DELAY=y
# CONFIG_GENERIC_TIME_VSYSCALL is not set
CONFIG_ARCH_HAS_CPU_RELAX=y
CONFIG_ARCH_HAS_DEFAULT_IDLE=y
CONFIG_ARCH_HAS_CACHE_LINE_SIZE=y
# CONFIG_HAVE_SETUP_PER_CPU_AREA is not set
# CONFIG_HAVE_CPUMASK_OF_CPU_MAP is not set
CONFIG_ARCH_HIBERNATION_POSSIBLE=y
CONFIG_ARCH_SUSPEND_POSSIBLE=y
# CONFIG_ZONE_DMA32 is not set
CONFIG_ARCH_POPULATES_NODE_MAP=y
# CONFIG_AUDIT_ARCH is not set
CONFIG_ARCH_SUPPORTS_OPTIMIZED_INLINING=y
CONFIG_GENERIC_HARDIRQS=y
CONFIG_GENERIC_IRQ_PROBE=y
CONFIG_X86_BIOS_REBOOT=y
CONFIG_KTIME_SCALAR=y
CONFIG_DEFCONFIG_LIST="/lib/modules/$UNAME_RELEASE/.config"

#
# General setup
#
CONFIG_EXPERIMENTAL=y
CONFIG_BROKEN_ON_SMP=y
CONFIG_LOCK_KERNEL=y
CONFIG_INIT_ENV_ARG_LIMIT=32
CONFIG_LOCALVERSION=""
CONFIG_LOCALVERSION_AUTO=y
# CONFIG_SWAP is not set
# CONFIG_SYSVIPC is not set
# CONFIG_POSIX_MQUEUE is not set
CONFIG_BSD_PROCESS_ACCT=y
CONFIG_BSD_PROCESS_ACCT_V3=y
CONFIG_TASKSTATS=y
# CONFIG_TASK_DELAY_ACCT is not set
# CONFIG_TASK_XACCT is not set
# CONFIG_AUDIT is not set

#
# RCU Subsystem
#
# CONFIG_CLASSIC_RCU is not set
# CONFIG_TREE_RCU is not set
CONFIG_PREEMPT_RCU=y
CONFIG_RCU_TRACE=y
# CONFIG_TREE_RCU_TRACE is not set
CONFIG_PREEMPT_RCU_TRACE=y
CONFIG_IKCONFIG=y
CONFIG_IKCONFIG_PROC=y
CONFIG_LOG_BUF_SHIFT=20
CONFIG_HAVE_UNSTABLE_SCHED_CLOCK=y
CONFIG_GROUP_SCHED=y
# CONFIG_FAIR_GROUP_SCHED is not set
CONFIG_RT_GROUP_SCHED=y
CONFIG_USER_SCHED=y
# CONFIG_CGROUP_SCHED is not set
CONFIG_CGROUPS=y
CONFIG_CGROUP_DEBUG=y
CONFIG_CGROUP_NS=y
CONFIG_CGROUP_FREEZER=y
CONFIG_CGROUP_DEVICE=y
CONFIG_CGROUP_CPUACCT=y
CONFIG_RESOURCE_COUNTERS=y
CONFIG_CGROUP_MEM_RES_CTLR=y
CONFIG_MM_OWNER=y
CONFIG_SYSFS_DEPRECATED=y
CONFIG_SYSFS_DEPRECATED_V2=y
CONFIG_RELAY=y
CONFIG_NAMESPACES=y
# CONFIG_UTS_NS is not set
# CONFIG_USER_NS is not set
# CONFIG_PID_NS is not set
# CONFIG_NET_NS is not set
CONFIG_BLK_DEV_INITRD=y
CONFIG_INITRAMFS_SOURCE=""
CONFIG_CC_OPTIMIZE_FOR_SIZE=y
CONFIG_SYSCTL=y
CONFIG_EMBEDDED=y
CONFIG_UID16=y
CONFIG_SYSCTL_SYSCALL=y
CONFIG_KALLSYMS=y
CONFIG_KALLSYMS_ALL=y
# CONFIG_KALLSYMS_EXTRA_PASS is not set
CONFIG_HOTPLUG=y
CONFIG_PRINTK=y
# CONFIG_BUG is not set
CONFIG_ELF_CORE=y
CONFIG_PCSPKR_PLATFORM=y
CONFIG_COMPAT_BRK=y
CONFIG_BASE_FULL=y
CONFIG_FUTEX=y
CONFIG_ANON_INODES=y
CONFIG_EPOLL=y
# CONFIG_SIGNALFD is not set
CONFIG_TIMERFD=y
CONFIG_EVENTFD=y
CONFIG_SHMEM=y
CONFIG_AIO=y
CONFIG_VM_EVENT_COUNTERS=y
CONFIG_PCI_QUIRKS=y
CONFIG_SLUB_DEBUG=y
# CONFIG_SLAB is not set
CONFIG_SLUB=y
# CONFIG_SLOB is not set
# CONFIG_PROFILING is not set
CONFIG_TRACEPOINTS=y
CONFIG_MARKERS=y
CONFIG_HAVE_OPROFILE=y
CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS=y
CONFIG_HAVE_IOREMAP_PROT=y
CONFIG_HAVE_KPROBES=y
CONFIG_HAVE_KRETPROBES=y
CONFIG_HAVE_ARCH_TRACEHOOK=y
CONFIG_HAVE_GENERIC_DMA_COHERENT=y
CONFIG_SLABINFO=y
CONFIG_RT_MUTEXES=y
CONFIG_BASE_SMALL=0
# CONFIG_MODULES is not set
CONFIG_BLOCK=y
CONFIG_LBD=y
# CONFIG_BLK_DEV_IO_TRACE is not set
CONFIG_BLK_DEV_BSG=y
# CONFIG_BLK_DEV_INTEGRITY is not set

#
# IO Schedulers
#
CONFIG_IOSCHED_NOOP=y
CONFIG_IOSCHED_AS=y
# CONFIG_IOSCHED_DEADLINE is not set
CONFIG_IOSCHED_CFQ=y
# CONFIG_DEFAULT_AS is not set
# CONFIG_DEFAULT_DEADLINE is not set
CONFIG_DEFAULT_CFQ=y
# CONFIG_DEFAULT_NOOP is not set
CONFIG_DEFAULT_IOSCHED="cfq"
CONFIG_PREEMPT_NOTIFIERS=y
CONFIG_FREEZER=y

#
# Processor type and features
#
# CONFIG_NO_HZ is not set
# CONFIG_HIGH_RES_TIMERS is not set
CONFIG_GENERIC_CLOCKEVENTS_BUILD=y
# CONFIG_SMP is not set
CONFIG_X86_FIND_SMP_CONFIG=y
CONFIG_X86_MPPARSE=y
# CONFIG_X86_PC is not set
CONFIG_X86_ELAN=y
# CONFIG_X86_VOYAGER is not set
# CONFIG_X86_GENERICARCH is not set
# CONFIG_X86_VSMP is not set
# CONFIG_X86_VISWS is not set
CONFIG_X86_RDC321X=y
# CONFIG_SCHED_OMIT_FRAME_POINTER is not set
CONFIG_PARAVIRT_GUEST=y
# CONFIG_VMI is not set
CONFIG_KVM_CLOCK=y
# CONFIG_KVM_GUEST is not set
# CONFIG_LGUEST_GUEST is not set
CONFIG_PARAVIRT=y
CONFIG_PARAVIRT_CLOCK=y
CONFIG_PARAVIRT_DEBUG=y
# CONFIG_MEMTEST is not set
CONFIG_X86_CPU=y
CONFIG_X86_CMPXCHG=y
CONFIG_X86_L1_CACHE_SHIFT=4
CONFIG_X86_XADD=y
CONFIG_X86_WP_WORKS_OK=y
CONFIG_X86_INVLPG=y
CONFIG_X86_BSWAP=y
CONFIG_X86_POPAD_OK=y
CONFIG_X86_ALIGNMENT_16=y
CONFIG_X86_MINIMUM_CPU_FAMILY=4
CONFIG_X86_DEBUGCTLMSR=y
CONFIG_PROCESSOR_SELECT=y
CONFIG_CPU_SUP_INTEL=y
CONFIG_CPU_SUP_CYRIX_32=y
# CONFIG_CPU_SUP_AMD is not set
CONFIG_CPU_SUP_CENTAUR_32=y
# CONFIG_CPU_SUP_TRANSMETA_32 is not set
CONFIG_CPU_SUP_UMC_32=y
CONFIG_X86_DS=y
CONFIG_X86_PTRACE_BTS=y
CONFIG_HPET_TIMER=y
CONFIG_HPET_EMULATE_RTC=y
CONFIG_DMI=y
# CONFIG_IOMMU_HELPER is not set
# CONFIG_IOMMU_API is not set
CONFIG_NR_CPUS=1
# CONFIG_PREEMPT_NONE is not set
# CONFIG_PREEMPT_VOLUNTARY is not set
CONFIG_PREEMPT=y
CONFIG_X86_UP_APIC=y
# CONFIG_X86_UP_IOAPIC is not set
CONFIG_X86_LOCAL_APIC=y
CONFIG_X86_MCE=y
CONFIG_X86_MCE_NONFATAL=y
CONFIG_X86_MCE_P4THERMAL=y
# CONFIG_VM86 is not set
# CONFIG_TOSHIBA is not set
CONFIG_I8K=y
CONFIG_X86_REBOOTFIXUPS=y
CONFIG_MICROCODE=y
CONFIG_MICROCODE_INTEL=y
CONFIG_MICROCODE_AMD=y
CONFIG_MICROCODE_OLD_INTERFACE=y
CONFIG_X86_MSR=y
# CONFIG_X86_CPUID is not set
# CONFIG_NOHIGHMEM is not set
CONFIG_HIGHMEM4G=y
# CONFIG_HIGHMEM64G is not set
# CONFIG_VMSPLIT_3G is not set
# CONFIG_VMSPLIT_3G_OPT is not set
# CONFIG_VMSPLIT_2G is not set
# CONFIG_VMSPLIT_2G_OPT is not set
CONFIG_VMSPLIT_1G=y
CONFIG_PAGE_OFFSET=0x40000000
CONFIG_HIGHMEM=y
# CONFIG_ARCH_PHYS_ADDR_T_64BIT is not set
CONFIG_SELECT_MEMORY_MODEL=y
CONFIG_FLATMEM_MANUAL=y
# CONFIG_DISCONTIGMEM_MANUAL is not set
# CONFIG_SPARSEMEM_MANUAL is not set
CONFIG_FLATMEM=y
CONFIG_FLAT_NODE_MEM_MAP=y
CONFIG_PAGEFLAGS_EXTENDED=y
CONFIG_SPLIT_PTLOCK_CPUS=4
# CONFIG_PHYS_ADDR_T_64BIT is not set
CONFIG_ZONE_DMA_FLAG=1
CONFIG_BOUNCE=y
CONFIG_VIRT_TO_BUS=y
CONFIG_UNEVICTABLE_LRU=y
CONFIG_MMU_NOTIFIER=y
CONFIG_HIGHPTE=y
# CONFIG_X86_CHECK_BIOS_CORRUPTION is not set
# CONFIG_X86_RESERVE_LOW_64K is not set
# CONFIG_MATH_EMULATION is not set
CONFIG_MTRR=y
# CONFIG_MTRR_SANITIZER is not set
CONFIG_X86_PAT=y
CONFIG_SECCOMP=y
CONFIG_HZ_100=y
# CONFIG_HZ_250 is not set
# CONFIG_HZ_300 is not set
# CONFIG_HZ_1000 is not set
CONFIG_HZ=100
# CONFIG_SCHED_HRTICK is not set
CONFIG_KEXEC=y
# CONFIG_CRASH_DUMP is not set
CONFIG_PHYSICAL_START=0x100000
# CONFIG_RELOCATABLE is not set
CONFIG_PHYSICAL_ALIGN=0x100000
CONFIG_COMPAT_VDSO=y
# CONFIG_CMDLINE_BOOL is not set
CONFIG_ARCH_ENABLE_MEMORY_HOTPLUG=y

#
# Power management and ACPI options
#
CONFIG_PM=y
CONFIG_PM_DEBUG=y
CONFIG_PM_VERBOSE=y
CONFIG_CAN_PM_TRACE=y
CONFIG_PM_TRACE=y
CONFIG_PM_TRACE_RTC=y
CONFIG_PM_SLEEP=y
CONFIG_SUSPEND=y
# CONFIG_PM_TEST_SUSPEND is not set
CONFIG_SUSPEND_FREEZER=y
# CONFIG_ACPI is not set
CONFIG_X86_APM_BOOT=y
CONFIG_APM=y
CONFIG_APM_IGNORE_USER_SUSPEND=y
CONFIG_APM_DO_ENABLE=y
CONFIG_APM_CPU_IDLE=y
CONFIG_APM_DISPLAY_BLANK=y
CONFIG_APM_ALLOW_INTS=y

#
# CPU Frequency scaling
#
# CONFIG_CPU_FREQ is not set
CONFIG_CPU_IDLE=y
CONFIG_CPU_IDLE_GOV_LADDER=y

#
# Bus options (PCI etc.)
#
CONFIG_PCI=y
# CONFIG_PCI_GOBIOS is not set
# CONFIG_PCI_GOMMCONFIG is not set
CONFIG_PCI_GODIRECT=y
# CONFIG_PCI_GOOLPC is not set
# CONFIG_PCI_GOANY is not set
CONFIG_PCI_DIRECT=y
CONFIG_PCI_DOMAINS=y
# CONFIG_PCIEPORTBUS is not set
# CONFIG_ARCH_SUPPORTS_MSI is not set
CONFIG_PCI_LEGACY=y
CONFIG_PCI_DEBUG=y
CONFIG_PCI_STUB=y
CONFIG_ISA_DMA_API=y
# CONFIG_ISA is not set
CONFIG_MCA=y
CONFIG_MCA_LEGACY=y
# CONFIG_MCA_PROC_FS is not set
# CONFIG_SCx200 is not set
# CONFIG_OLPC is not set
# CONFIG_PCCARD is not set
# CONFIG_HOTPLUG_PCI is not set

#
# Executable file formats / Emulations
#
CONFIG_BINFMT_ELF=y
CONFIG_CORE_DUMP_DEFAULT_ELF_HEADERS=y
CONFIG_HAVE_AOUT=y
CONFIG_BINFMT_AOUT=y
# CONFIG_BINFMT_MISC is not set
CONFIG_HAVE_ATOMIC_IOMAP=y
CONFIG_NET=y

#
# Networking options
#
CONFIG_COMPAT_NET_DEV_OPS=y
CONFIG_PACKET=y
CONFIG_PACKET_MMAP=y
CONFIG_UNIX=y
CONFIG_XFRM=y
CONFIG_XFRM_USER=y
CONFIG_XFRM_SUB_POLICY=y
CONFIG_XFRM_MIGRATE=y
CONFIG_XFRM_STATISTICS=y
CONFIG_XFRM_IPCOMP=y
CONFIG_NET_KEY=y
CONFIG_NET_KEY_MIGRATE=y
CONFIG_INET=y
CONFIG_IP_MULTICAST=y
CONFIG_IP_ADVANCED_ROUTER=y
CONFIG_ASK_IP_FIB_HASH=y
# CONFIG_IP_FIB_TRIE is not set
CONFIG_IP_FIB_HASH=y
CONFIG_IP_MULTIPLE_TABLES=y
CONFIG_IP_ROUTE_MULTIPATH=y
CONFIG_IP_ROUTE_VERBOSE=y
CONFIG_IP_PNP=y
CONFIG_IP_PNP_DHCP=y
CONFIG_IP_PNP_BOOTP=y
CONFIG_IP_PNP_RARP=y
CONFIG_NET_IPIP=y
# CONFIG_NET_IPGRE is not set
# CONFIG_IP_MROUTE is not set
# CONFIG_ARPD is not set
CONFIG_SYN_COOKIES=y
CONFIG_INET_AH=y
# CONFIG_INET_ESP is not set
CONFIG_INET_IPCOMP=y
CONFIG_INET_XFRM_TUNNEL=y
CONFIG_INET_TUNNEL=y
CONFIG_INET_XFRM_MODE_TRANSPORT=y
# CONFIG_INET_XFRM_MODE_TUNNEL is not set
CONFIG_INET_XFRM_MODE_BEET=y
CONFIG_INET_LRO=y
CONFIG_INET_DIAG=y
CONFIG_INET_TCP_DIAG=y
CONFIG_TCP_CONG_ADVANCED=y
# CONFIG_TCP_CONG_BIC is not set
CONFIG_TCP_CONG_CUBIC=y
# CONFIG_TCP_CONG_WESTWOOD is not set
# CONFIG_TCP_CONG_HTCP is not set
# CONFIG_TCP_CONG_HSTCP is not set
CONFIG_TCP_CONG_HYBLA=y
CONFIG_TCP_CONG_VEGAS=y
CONFIG_TCP_CONG_SCALABLE=y
# CONFIG_TCP_CONG_LP is not set
CONFIG_TCP_CONG_VENO=y
CONFIG_TCP_CONG_YEAH=y
CONFIG_TCP_CONG_ILLINOIS=y
# CONFIG_DEFAULT_BIC is not set
CONFIG_DEFAULT_CUBIC=y
# CONFIG_DEFAULT_HTCP is not set
# CONFIG_DEFAULT_VEGAS is not set
# CONFIG_DEFAULT_WESTWOOD is not set
# CONFIG_DEFAULT_RENO is not set
CONFIG_DEFAULT_TCP_CONG="cubic"
CONFIG_TCP_MD5SIG=y
# CONFIG_IPV6 is not set
CONFIG_NETLABEL=y
CONFIG_NETWORK_SECMARK=y
CONFIG_NETFILTER=y
CONFIG_NETFILTER_DEBUG=y
CONFIG_NETFILTER_ADVANCED=y
CONFIG_BRIDGE_NETFILTER=y

#
# Core Netfilter Configuration
#
CONFIG_NETFILTER_NETLINK=y
CONFIG_NETFILTER_NETLINK_QUEUE=y
# CONFIG_NETFILTER_NETLINK_LOG is not set
CONFIG_NF_CONNTRACK=y
CONFIG_NF_CT_ACCT=y
# CONFIG_NF_CONNTRACK_MARK is not set
CONFIG_NF_CONNTRACK_SECMARK=y
# CONFIG_NF_CONNTRACK_EVENTS is not set
# CONFIG_NF_CT_PROTO_DCCP is not set
CONFIG_NF_CT_PROTO_GRE=y
CONFIG_NF_CT_PROTO_SCTP=y
CONFIG_NF_CT_PROTO_UDPLITE=y
# CONFIG_NF_CONNTRACK_AMANDA is not set
CONFIG_NF_CONNTRACK_FTP=y
CONFIG_NF_CONNTRACK_H323=y
CONFIG_NF_CONNTRACK_IRC=y
CONFIG_NF_CONNTRACK_NETBIOS_NS=y
CONFIG_NF_CONNTRACK_PPTP=y
CONFIG_NF_CONNTRACK_SANE=y
CONFIG_NF_CONNTRACK_SIP=y
CONFIG_NF_CONNTRACK_TFTP=y
CONFIG_NF_CT_NETLINK=y
CONFIG_NETFILTER_TPROXY=y
CONFIG_NETFILTER_XTABLES=y
CONFIG_NETFILTER_XT_TARGET_CLASSIFY=y
# CONFIG_NETFILTER_XT_TARGET_CONNMARK is not set
CONFIG_NETFILTER_XT_TARGET_CONNSECMARK=y
CONFIG_NETFILTER_XT_TARGET_DSCP=y
CONFIG_NETFILTER_XT_TARGET_MARK=y
# CONFIG_NETFILTER_XT_TARGET_NFLOG is not set
# CONFIG_NETFILTER_XT_TARGET_NFQUEUE is not set
# CONFIG_NETFILTER_XT_TARGET_NOTRACK is not set
CONFIG_NETFILTER_XT_TARGET_RATEEST=y
CONFIG_NETFILTER_XT_TARGET_TPROXY=y
# CONFIG_NETFILTER_XT_TARGET_TRACE is not set
CONFIG_NETFILTER_XT_TARGET_SECMARK=y
# CONFIG_NETFILTER_XT_TARGET_TCPMSS is not set
CONFIG_NETFILTER_XT_TARGET_TCPOPTSTRIP=y
# CONFIG_NETFILTER_XT_MATCH_COMMENT is not set
CONFIG_NETFILTER_XT_MATCH_CONNBYTES=y
# CONFIG_NETFILTER_XT_MATCH_CONNLIMIT is not set
# CONFIG_NETFILTER_XT_MATCH_CONNMARK is not set
CONFIG_NETFILTER_XT_MATCH_CONNTRACK=y
CONFIG_NETFILTER_XT_MATCH_DCCP=y
CONFIG_NETFILTER_XT_MATCH_DSCP=y
# CONFIG_NETFILTER_XT_MATCH_ESP is not set
CONFIG_NETFILTER_XT_MATCH_HASHLIMIT=y
# CONFIG_NETFILTER_XT_MATCH_HELPER is not set
CONFIG_NETFILTER_XT_MATCH_IPRANGE=y
# CONFIG_NETFILTER_XT_MATCH_LENGTH is not set
CONFIG_NETFILTER_XT_MATCH_LIMIT=y
CONFIG_NETFILTER_XT_MATCH_MAC=y
CONFIG_NETFILTER_XT_MATCH_MARK=y
# CONFIG_NETFILTER_XT_MATCH_MULTIPORT is not set
CONFIG_NETFILTER_XT_MATCH_OWNER=y
# CONFIG_NETFILTER_XT_MATCH_POLICY is not set
CONFIG_NETFILTER_XT_MATCH_PHYSDEV=y
# CONFIG_NETFILTER_XT_MATCH_PKTTYPE is not set
CONFIG_NETFILTER_XT_MATCH_QUOTA=y
CONFIG_NETFILTER_XT_MATCH_RATEEST=y
CONFIG_NETFILTER_XT_MATCH_REALM=y
# CONFIG_NETFILTER_XT_MATCH_RECENT is not set
CONFIG_NETFILTER_XT_MATCH_SCTP=y
CONFIG_NETFILTER_XT_MATCH_SOCKET=y
CONFIG_NETFILTER_XT_MATCH_STATE=y
CONFIG_NETFILTER_XT_MATCH_STATISTIC=y
# CONFIG_NETFILTER_XT_MATCH_STRING is not set
# CONFIG_NETFILTER_XT_MATCH_TCPMSS is not set
CONFIG_NETFILTER_XT_MATCH_TIME=y
# CONFIG_NETFILTER_XT_MATCH_U32 is not set
CONFIG_IP_VS=y
# CONFIG_IP_VS_DEBUG is not set
CONFIG_IP_VS_TAB_BITS=12

#
# IPVS transport protocol load balancing support
#
# CONFIG_IP_VS_PROTO_TCP is not set
CONFIG_IP_VS_PROTO_UDP=y
CONFIG_IP_VS_PROTO_AH_ESP=y
CONFIG_IP_VS_PROTO_ESP=y
CONFIG_IP_VS_PROTO_AH=y

#
# IPVS scheduler
#
# CONFIG_IP_VS_RR is not set
CONFIG_IP_VS_WRR=y
# CONFIG_IP_VS_LC is not set
CONFIG_IP_VS_WLC=y
CONFIG_IP_VS_LBLC=y
CONFIG_IP_VS_LBLCR=y
CONFIG_IP_VS_DH=y
CONFIG_IP_VS_SH=y
CONFIG_IP_VS_SED=y
CONFIG_IP_VS_NQ=y

#
# IPVS application helper
#

#
# IP: Netfilter Configuration
#
CONFIG_NF_DEFRAG_IPV4=y
# CONFIG_NF_CONNTRACK_IPV4 is not set
CONFIG_IP_NF_QUEUE=y
CONFIG_IP_NF_IPTABLES=y
# CONFIG_IP_NF_MATCH_ADDRTYPE is not set
CONFIG_IP_NF_MATCH_AH=y
CONFIG_IP_NF_MATCH_ECN=y
CONFIG_IP_NF_MATCH_TTL=y
CONFIG_IP_NF_FILTER=y
CONFIG_IP_NF_TARGET_REJECT=y
# CONFIG_IP_NF_TARGET_LOG is not set
CONFIG_IP_NF_TARGET_ULOG=y
CONFIG_IP_NF_MANGLE=y
# CONFIG_IP_NF_TARGET_ECN is not set
# CONFIG_IP_NF_TARGET_TTL is not set
CONFIG_IP_NF_RAW=y
# CONFIG_IP_NF_SECURITY is not set
# CONFIG_IP_NF_ARPTABLES is not set

#
# DECnet: Netfilter Configuration
#
CONFIG_DECNET_NF_GRABULATOR=y
CONFIG_BRIDGE_NF_EBTABLES=y
# CONFIG_BRIDGE_EBT_BROUTE is not set
CONFIG_BRIDGE_EBT_T_FILTER=y
# CONFIG_BRIDGE_EBT_T_NAT is not set
# CONFIG_BRIDGE_EBT_802_3 is not set
CONFIG_BRIDGE_EBT_AMONG=y
# CONFIG_BRIDGE_EBT_ARP is not set
CONFIG_BRIDGE_EBT_IP=y
CONFIG_BRIDGE_EBT_LIMIT=y
CONFIG_BRIDGE_EBT_MARK=y
CONFIG_BRIDGE_EBT_PKTTYPE=y
CONFIG_BRIDGE_EBT_STP=y
CONFIG_BRIDGE_EBT_VLAN=y
CONFIG_BRIDGE_EBT_ARPREPLY=y
CONFIG_BRIDGE_EBT_DNAT=y
CONFIG_BRIDGE_EBT_MARK_T=y
CONFIG_BRIDGE_EBT_REDIRECT=y
CONFIG_BRIDGE_EBT_SNAT=y
CONFIG_BRIDGE_EBT_LOG=y
CONFIG_BRIDGE_EBT_ULOG=y
CONFIG_BRIDGE_EBT_NFLOG=y
CONFIG_IP_DCCP=y
CONFIG_INET_DCCP_DIAG=y

#
# DCCP CCIDs Configuration (EXPERIMENTAL)
#
CONFIG_IP_DCCP_CCID2_DEBUG=y
# CONFIG_IP_DCCP_CCID3 is not set

#
# DCCP Kernel Hacking
#
CONFIG_IP_DCCP_DEBUG=y
CONFIG_IP_SCTP=y
# CONFIG_SCTP_DBG_MSG is not set
CONFIG_SCTP_DBG_OBJCNT=y
CONFIG_SCTP_HMAC_NONE=y
# CONFIG_SCTP_HMAC_SHA1 is not set
# CONFIG_SCTP_HMAC_MD5 is not set
# CONFIG_TIPC is not set
CONFIG_ATM=y
CONFIG_ATM_CLIP=y
CONFIG_ATM_CLIP_NO_ICMP=y
CONFIG_ATM_LANE=y
CONFIG_ATM_MPOA=y
CONFIG_ATM_BR2684=y
CONFIG_ATM_BR2684_IPFILTER=y
CONFIG_STP=y
CONFIG_BRIDGE=y
CONFIG_NET_DSA=y
CONFIG_NET_DSA_TAG_DSA=y
CONFIG_NET_DSA_TAG_EDSA=y
CONFIG_NET_DSA_TAG_TRAILER=y
CONFIG_NET_DSA_MV88E6XXX=y
CONFIG_NET_DSA_MV88E6060=y
CONFIG_NET_DSA_MV88E6XXX_NEED_PPU=y
CONFIG_NET_DSA_MV88E6131=y
CONFIG_NET_DSA_MV88E6123_61_65=y
# CONFIG_VLAN_8021Q is not set
CONFIG_DECNET=y
CONFIG_DECNET_ROUTER=y
CONFIG_LLC=y
# CONFIG_LLC2 is not set
# CONFIG_IPX is not set
# CONFIG_ATALK is not set
CONFIG_X25=y
CONFIG_LAPB=y
# CONFIG_ECONET is not set
CONFIG_WAN_ROUTER=y
# CONFIG_NET_SCHED is not set
CONFIG_NET_CLS_ROUTE=y
# CONFIG_DCB is not set

#
# Network testing
#
CONFIG_NET_PKTGEN=y
# CONFIG_HAMRADIO is not set
# CONFIG_CAN is not set
CONFIG_IRDA=y

#
# IrDA protocols
#
CONFIG_IRLAN=y
CONFIG_IRNET=y
CONFIG_IRCOMM=y
# CONFIG_IRDA_ULTRA is not set

#
# IrDA options
#
# CONFIG_IRDA_CACHE_LAST_LSAP is not set
CONFIG_IRDA_FAST_RR=y
CONFIG_IRDA_DEBUG=y

#
# Infrared-port device drivers
#

#
# SIR device drivers
#
# CONFIG_IRTTY_SIR is not set

#
# Dongle support
#
CONFIG_KINGSUN_DONGLE=y
# CONFIG_KSDAZZLE_DONGLE is not set
# CONFIG_KS959_DONGLE is not set

#
# FIR device drivers
#
# CONFIG_USB_IRDA is not set
# CONFIG_SIGMATEL_FIR is not set
CONFIG_NSC_FIR=y
CONFIG_WINBOND_FIR=y
# CONFIG_TOSHIBA_FIR is not set
CONFIG_SMC_IRCC_FIR=y
CONFIG_ALI_FIR=y
CONFIG_VLSI_FIR=y
CONFIG_VIA_FIR=y
CONFIG_MCS_FIR=y
CONFIG_BT=y
CONFIG_BT_L2CAP=y
CONFIG_BT_SCO=y
CONFIG_BT_RFCOMM=y
CONFIG_BT_RFCOMM_TTY=y
CONFIG_BT_BNEP=y
CONFIG_BT_BNEP_MC_FILTER=y
CONFIG_BT_BNEP_PROTO_FILTER=y
# CONFIG_BT_HIDP is not set

#
# Bluetooth device drivers
#
CONFIG_BT_HCIBTUSB=y
CONFIG_BT_HCIBTSDIO=y
CONFIG_BT_HCIUART=y
CONFIG_BT_HCIUART_H4=y
# CONFIG_BT_HCIUART_BCSP is not set
CONFIG_BT_HCIUART_LL=y
CONFIG_BT_HCIBCM203X=y
CONFIG_BT_HCIBPA10X=y
CONFIG_BT_HCIBFUSB=y
# CONFIG_BT_HCIVHCI is not set
# CONFIG_AF_RXRPC is not set
CONFIG_PHONET=y
CONFIG_FIB_RULES=y
# CONFIG_WIRELESS is not set
CONFIG_WIRELESS_EXT=y
CONFIG_LIB80211=y
CONFIG_LIB80211_CRYPT_WEP=y
CONFIG_LIB80211_CRYPT_CCMP=y
CONFIG_LIB80211_CRYPT_TKIP=y
CONFIG_WIMAX=y
CONFIG_WIMAX_DEBUG_LEVEL=8
# CONFIG_RFKILL is not set
# CONFIG_NET_9P is not set

#
# Device Drivers
#

#
# Generic Driver Options
#
CONFIG_UEVENT_HELPER_PATH="/sbin/hotplug"
CONFIG_STANDALONE=y
CONFIG_PREVENT_FIRMWARE_BUILD=y
CONFIG_FW_LOADER=y
CONFIG_FIRMWARE_IN_KERNEL=y
CONFIG_EXTRA_FIRMWARE=""
# CONFIG_DEBUG_DRIVER is not set
CONFIG_DEBUG_DEVRES=y
# CONFIG_SYS_HYPERVISOR is not set
CONFIG_CONNECTOR=y
CONFIG_PROC_EVENTS=y
# CONFIG_MTD is not set
CONFIG_PARPORT=y
CONFIG_PARPORT_PC=y
CONFIG_PARPORT_SERIAL=y
CONFIG_PARPORT_PC_FIFO=y
# CONFIG_PARPORT_PC_SUPERIO is not set
# CONFIG_PARPORT_GSC is not set
CONFIG_PARPORT_AX88796=y
# CONFIG_PARPORT_1284 is not set
CONFIG_PARPORT_NOT_PC=y
CONFIG_BLK_DEV=y
# CONFIG_BLK_DEV_FD is not set
# CONFIG_PARIDE is not set
CONFIG_BLK_CPQ_DA=y
CONFIG_BLK_CPQ_CISS_DA=y
# CONFIG_CISS_SCSI_TAPE is not set
CONFIG_BLK_DEV_DAC960=y
# CONFIG_BLK_DEV_UMEM is not set
# CONFIG_BLK_DEV_COW_COMMON is not set
CONFIG_BLK_DEV_LOOP=y
CONFIG_BLK_DEV_CRYPTOLOOP=y
CONFIG_BLK_DEV_NBD=y
CONFIG_BLK_DEV_SX8=y
CONFIG_BLK_DEV_UB=y
CONFIG_BLK_DEV_RAM=y
CONFIG_BLK_DEV_RAM_COUNT=16
CONFIG_BLK_DEV_RAM_SIZE=4096
CONFIG_BLK_DEV_XIP=y
CONFIG_CDROM_PKTCDVD=y
CONFIG_CDROM_PKTCDVD_BUFFERS=8
# CONFIG_CDROM_PKTCDVD_WCACHE is not set
# CONFIG_ATA_OVER_ETH is not set
# CONFIG_VIRTIO_BLK is not set
# CONFIG_BLK_DEV_HD is not set
# CONFIG_MISC_DEVICES is not set
CONFIG_TIFM_CORE=y
CONFIG_HAVE_IDE=y
# CONFIG_IDE is not set

#
# SCSI device support
#
# CONFIG_RAID_ATTRS is not set
CONFIG_SCSI=y
CONFIG_SCSI_DMA=y
CONFIG_SCSI_TGT=y
CONFIG_SCSI_NETLINK=y
CONFIG_SCSI_PROC_FS=y

#
# SCSI support type (disk, tape, CD-ROM)
#
CONFIG_BLK_DEV_SD=y
CONFIG_CHR_DEV_ST=y
CONFIG_CHR_DEV_OSST=y
CONFIG_BLK_DEV_SR=y
CONFIG_BLK_DEV_SR_VENDOR=y
# CONFIG_CHR_DEV_SG is not set
CONFIG_CHR_DEV_SCH=y

#
# Some SCSI devices (e.g. CD jukebox) support multiple LUNs
#
CONFIG_SCSI_MULTI_LUN=y
CONFIG_SCSI_CONSTANTS=y
# CONFIG_SCSI_LOGGING is not set
CONFIG_SCSI_SCAN_ASYNC=y

#
# SCSI Transports
#
CONFIG_SCSI_SPI_ATTRS=y
CONFIG_SCSI_FC_ATTRS=y
CONFIG_SCSI_FC_TGT_ATTRS=y
CONFIG_SCSI_ISCSI_ATTRS=y
# CONFIG_SCSI_SAS_ATTRS is not set
# CONFIG_SCSI_SAS_LIBSAS is not set
# CONFIG_SCSI_SRP_ATTRS is not set
CONFIG_SCSI_LOWLEVEL=y
# CONFIG_ISCSI_TCP is not set
CONFIG_SCSI_CXGB3_ISCSI=y
CONFIG_BLK_DEV_3W_XXXX_RAID=y
CONFIG_SCSI_3W_9XXX=y
CONFIG_SCSI_ACARD=y
CONFIG_SCSI_AACRAID=y
CONFIG_SCSI_AIC7XXX=y
CONFIG_AIC7XXX_CMDS_PER_DEVICE=32
CONFIG_AIC7XXX_RESET_DELAY_MS=5000
# CONFIG_AIC7XXX_DEBUG_ENABLE is not set
CONFIG_AIC7XXX_DEBUG_MASK=0
# CONFIG_AIC7XXX_REG_PRETTY_PRINT is not set
# CONFIG_SCSI_AIC7XXX_OLD is not set
CONFIG_SCSI_AIC79XX=y
CONFIG_AIC79XX_CMDS_PER_DEVICE=32
CONFIG_AIC79XX_RESET_DELAY_MS=5000
CONFIG_AIC79XX_DEBUG_ENABLE=y
CONFIG_AIC79XX_DEBUG_MASK=0
CONFIG_AIC79XX_REG_PRETTY_PRINT=y
# CONFIG_SCSI_AIC94XX is not set
# CONFIG_SCSI_DPT_I2O is not set
# CONFIG_SCSI_ADVANSYS is not set
CONFIG_SCSI_ARCMSR=y
# CONFIG_MEGARAID_NEWGEN is not set
CONFIG_MEGARAID_LEGACY=y
# CONFIG_MEGARAID_SAS is not set
# CONFIG_SCSI_HPTIOP is not set
# CONFIG_SCSI_BUSLOGIC is not set
CONFIG_LIBFC=y
CONFIG_FCOE=y
CONFIG_SCSI_DMX3191D=y
# CONFIG_SCSI_EATA is not set
# CONFIG_SCSI_FUTURE_DOMAIN is not set
# CONFIG_SCSI_FD_MCS is not set
CONFIG_SCSI_GDTH=y
CONFIG_SCSI_IBMMCA=y
CONFIG_IBMMCA_SCSI_ORDER_STANDARD=y
CONFIG_IBMMCA_SCSI_DEV_RESET=y
CONFIG_SCSI_IPS=y
# CONFIG_SCSI_INITIO is not set
CONFIG_SCSI_INIA100=y
# CONFIG_SCSI_PPA is not set
CONFIG_SCSI_IMM=y
CONFIG_SCSI_IZIP_EPP16=y
# CONFIG_SCSI_IZIP_SLOW_CTR is not set
# CONFIG_SCSI_MVSAS is not set
CONFIG_SCSI_NCR_D700=y
CONFIG_SCSI_STEX=y
# CONFIG_SCSI_SYM53C8XX_2 is not set
# CONFIG_SCSI_IPR is not set
CONFIG_SCSI_NCR_Q720=y
CONFIG_SCSI_NCR53C8XX_DEFAULT_TAGS=8
CONFIG_SCSI_NCR53C8XX_MAX_TAGS=32
CONFIG_SCSI_NCR53C8XX_SYNC=20
# CONFIG_SCSI_QLOGIC_1280 is not set
CONFIG_SCSI_QLA_FC=y
CONFIG_SCSI_QLA_ISCSI=y
CONFIG_SCSI_LPFC=y
CONFIG_SCSI_LPFC_DEBUG_FS=y
CONFIG_SCSI_SIM710=y
CONFIG_SCSI_DC395x=y
# CONFIG_SCSI_DC390T is not set
CONFIG_SCSI_NSP32=y
# CONFIG_SCSI_DEBUG is not set
CONFIG_SCSI_SRP=y
CONFIG_SCSI_DH=y
CONFIG_SCSI_DH_RDAC=y
# CONFIG_SCSI_DH_HP_SW is not set
CONFIG_SCSI_DH_EMC=y
# CONFIG_SCSI_DH_ALUA is not set
CONFIG_ATA=y
# CONFIG_ATA_NONSTANDARD is not set
CONFIG_SATA_PMP=y
CONFIG_SATA_AHCI=y
CONFIG_SATA_SIL24=y
CONFIG_ATA_SFF=y
CONFIG_SATA_SVW=y
CONFIG_ATA_PIIX=y
CONFIG_SATA_MV=y
CONFIG_SATA_NV=y
CONFIG_PDC_ADMA=y
CONFIG_SATA_QSTOR=y
CONFIG_SATA_PROMISE=y
CONFIG_SATA_SX4=y
CONFIG_SATA_SIL=y
CONFIG_SATA_SIS=y
CONFIG_SATA_ULI=y
# CONFIG_SATA_VIA is not set
# CONFIG_SATA_VITESSE is not set
CONFIG_SATA_INIC162X=y
CONFIG_PATA_ALI=y
CONFIG_PATA_AMD=y
# CONFIG_PATA_ARTOP is not set
CONFIG_PATA_ATIIXP=y
CONFIG_PATA_CMD640_PCI=y
# CONFIG_PATA_CMD64X is not set
CONFIG_PATA_CS5520=y
# CONFIG_PATA_CS5530 is not set
CONFIG_PATA_CS5535=y
CONFIG_PATA_CS5536=y
CONFIG_PATA_CYPRESS=y
CONFIG_PATA_EFAR=y
CONFIG_ATA_GENERIC=y
CONFIG_PATA_HPT366=y
# CONFIG_PATA_HPT37X is not set
CONFIG_PATA_HPT3X2N=y
CONFIG_PATA_HPT3X3=y
CONFIG_PATA_HPT3X3_DMA=y
CONFIG_PATA_IT821X=y
# CONFIG_PATA_IT8213 is not set
# CONFIG_PATA_JMICRON is not set
CONFIG_PATA_TRIFLEX=y
# CONFIG_PATA_MARVELL is not set
# CONFIG_PATA_MPIIX is not set
CONFIG_PATA_OLDPIIX=y
# CONFIG_PATA_NETCELL is not set
CONFIG_PATA_NINJA32=y
CONFIG_PATA_NS87410=y
# CONFIG_PATA_NS87415 is not set
CONFIG_PATA_OPTI=y
CONFIG_PATA_OPTIDMA=y
CONFIG_PATA_PDC_OLD=y
CONFIG_PATA_RADISYS=y
# CONFIG_PATA_RZ1000 is not set
# CONFIG_PATA_SC1200 is not set
# CONFIG_PATA_SERVERWORKS is not set
CONFIG_PATA_PDC2027X=y
# CONFIG_PATA_SIL680 is not set
CONFIG_PATA_SIS=y
# CONFIG_PATA_VIA is not set
CONFIG_PATA_WINBOND=y
CONFIG_PATA_PLATFORM=y
# CONFIG_PATA_SCH is not set
CONFIG_MD=y
CONFIG_BLK_DEV_MD=y
CONFIG_MD_AUTODETECT=y
# CONFIG_MD_LINEAR is not set
CONFIG_MD_RAID0=y
# CONFIG_MD_RAID1 is not set
# CONFIG_MD_RAID10 is not set
# CONFIG_MD_RAID456 is not set
CONFIG_MD_MULTIPATH=y
CONFIG_MD_FAULTY=y
CONFIG_BLK_DEV_DM=y
CONFIG_DM_DEBUG=y
CONFIG_DM_CRYPT=y
# CONFIG_DM_SNAPSHOT is not set
CONFIG_DM_MIRROR=y
# CONFIG_DM_ZERO is not set
# CONFIG_DM_MULTIPATH is not set
CONFIG_DM_DELAY=y
# CONFIG_DM_UEVENT is not set
# CONFIG_FUSION is not set

#
# IEEE 1394 (FireWire) support
#

#
# Enable only one of the two stacks, unless you know what you are doing
#
CONFIG_FIREWIRE=y
CONFIG_FIREWIRE_OHCI=y
CONFIG_FIREWIRE_OHCI_DEBUG=y
# CONFIG_FIREWIRE_SBP2 is not set
CONFIG_IEEE1394=y
CONFIG_IEEE1394_OHCI1394=y
CONFIG_IEEE1394_PCILYNX=y
CONFIG_IEEE1394_SBP2=y
CONFIG_IEEE1394_SBP2_PHYS_DMA=y
CONFIG_IEEE1394_ETH1394_ROM_ENTRY=y
CONFIG_IEEE1394_ETH1394=y
CONFIG_IEEE1394_RAWIO=y
CONFIG_IEEE1394_VIDEO1394=y
# CONFIG_IEEE1394_DV1394 is not set
CONFIG_IEEE1394_VERBOSEDEBUG=y
CONFIG_I2O=y
# CONFIG_I2O_LCT_NOTIFY_ON_CHANGES is not set
# CONFIG_I2O_EXT_ADAPTEC is not set
# CONFIG_I2O_CONFIG is not set
# CONFIG_I2O_BUS is not set
# CONFIG_I2O_BLOCK is not set
CONFIG_I2O_SCSI=y
CONFIG_I2O_PROC=y
# CONFIG_MACINTOSH_DRIVERS is not set
CONFIG_NETDEVICES=y
CONFIG_DUMMY=y
CONFIG_BONDING=y
CONFIG_MACVLAN=y
CONFIG_EQUALIZER=y
CONFIG_TUN=y
CONFIG_VETH=y
# CONFIG_ARCNET is not set
CONFIG_PHYLIB=y

#
# MII PHY device drivers
#
CONFIG_MARVELL_PHY=y
# CONFIG_DAVICOM_PHY is not set
CONFIG_QSEMI_PHY=y
CONFIG_LXT_PHY=y
CONFIG_CICADA_PHY=y
# CONFIG_VITESSE_PHY is not set
CONFIG_SMSC_PHY=y
# CONFIG_BROADCOM_PHY is not set
CONFIG_ICPLUS_PHY=y
CONFIG_REALTEK_PHY=y
# CONFIG_NATIONAL_PHY is not set
CONFIG_STE10XP=y
CONFIG_LSI_ET1011C_PHY=y
# CONFIG_FIXED_PHY is not set
CONFIG_MDIO_BITBANG=y
CONFIG_MDIO_GPIO=y
CONFIG_NET_ETHERNET=y
CONFIG_MII=y
CONFIG_HAPPYMEAL=y
CONFIG_SUNGEM=y
CONFIG_CASSINI=y
# CONFIG_NET_VENDOR_3COM is not set
CONFIG_NET_VENDOR_SMC=y
CONFIG_ULTRAMCA=y
CONFIG_ENC28J60=y
# CONFIG_ENC28J60_WRITEVERIFY is not set
# CONFIG_DNET is not set
# CONFIG_NET_TULIP is not set
CONFIG_AT1700=y
# CONFIG_DEPCA is not set
CONFIG_HP100=y
CONFIG_NE2_MCA=y
CONFIG_IBMLANA=y
# CONFIG_IBM_NEW_EMAC_ZMII is not set
# CONFIG_IBM_NEW_EMAC_RGMII is not set
# CONFIG_IBM_NEW_EMAC_TAH is not set
# CONFIG_IBM_NEW_EMAC_EMAC4 is not set
# CONFIG_IBM_NEW_EMAC_NO_FLOW_CTRL is not set
# CONFIG_IBM_NEW_EMAC_MAL_CLR_ICINTSTAT is not set
# CONFIG_IBM_NEW_EMAC_MAL_COMMON_ERR is not set
CONFIG_NET_PCI=y
# CONFIG_PCNET32 is not set
# CONFIG_AMD8111_ETH is not set
CONFIG_ADAPTEC_STARFIRE=y
# CONFIG_B44 is not set
CONFIG_FORCEDETH=y
# CONFIG_FORCEDETH_NAPI is not set
CONFIG_E100=y
CONFIG_FEALNX=y
# CONFIG_NATSEMI is not set
CONFIG_NE2K_PCI=y
# CONFIG_8139CP is not set
CONFIG_8139TOO=y
CONFIG_8139TOO_PIO=y
CONFIG_8139TOO_TUNE_TWISTER=y
CONFIG_8139TOO_8129=y
# CONFIG_8139_OLD_RX_RESET is not set
CONFIG_R6040=y
CONFIG_SIS900=y
CONFIG_EPIC100=y
CONFIG_SMSC9420=y
CONFIG_SUNDANCE=y
CONFIG_SUNDANCE_MMIO=y
# CONFIG_TLAN is not set
CONFIG_VIA_RHINE=y
# CONFIG_VIA_RHINE_MMIO is not set
# CONFIG_SC92031 is not set
# CONFIG_NET_POCKET is not set
CONFIG_ATL2=y
CONFIG_NETDEV_1000=y
# CONFIG_ACENIC is not set
CONFIG_DL2K=y
CONFIG_E1000=y
CONFIG_E1000E=y
CONFIG_IP1000=y
CONFIG_IGB=y
CONFIG_IGB_LRO=y
CONFIG_NS83820=y
CONFIG_HAMACHI=y
CONFIG_YELLOWFIN=y
CONFIG_R8169=y
CONFIG_SIS190=y
# CONFIG_SKGE is not set
CONFIG_SKY2=y
# CONFIG_SKY2_DEBUG is not set
# CONFIG_VIA_VELOCITY is not set
CONFIG_TIGON3=y
CONFIG_BNX2=y
CONFIG_QLA3XXX=y
# CONFIG_ATL1 is not set
CONFIG_ATL1E=y
# CONFIG_ATL1C is not set
CONFIG_JME=y
CONFIG_NETDEV_10000=y
# CONFIG_CHELSIO_T1 is not set
CONFIG_CHELSIO_T3_DEPENDS=y
CONFIG_CHELSIO_T3=y
# CONFIG_ENIC is not set
CONFIG_IXGBE=y
CONFIG_IXGB=y
# CONFIG_S2IO is not set
CONFIG_MYRI10GE=y
# CONFIG_NETXEN_NIC is not set
# CONFIG_NIU is not set
# CONFIG_MLX4_EN is not set
# CONFIG_MLX4_CORE is not set
CONFIG_TEHUTI=y
CONFIG_BNX2X=y
CONFIG_QLGE=y
CONFIG_SFC=y
CONFIG_BE2NET=y
CONFIG_TR=y
CONFIG_IBMTR=y
# CONFIG_IBMOL is not set
CONFIG_IBMLS=y
CONFIG_3C359=y
# CONFIG_TMS380TR is not set
# CONFIG_SMCTR is not set

#
# Wireless LAN
#
# CONFIG_WLAN_PRE80211 is not set
CONFIG_WLAN_80211=y
# CONFIG_LIBERTAS is not set
# CONFIG_AIRO is not set
CONFIG_HERMES=y
# CONFIG_HERMES_CACHE_FW_ON_INIT is not set
# CONFIG_PLX_HERMES is not set
CONFIG_TMD_HERMES=y
CONFIG_NORTEL_HERMES=y
# CONFIG_PCI_HERMES is not set
CONFIG_ATMEL=y
# CONFIG_PCI_ATMEL is not set
CONFIG_PRISM54=y
CONFIG_USB_ZD1201=y
# CONFIG_USB_NET_RNDIS_WLAN is not set
CONFIG_IPW2100=y
CONFIG_IPW2100_MONITOR=y
CONFIG_IPW2100_DEBUG=y
# CONFIG_IPW2200 is not set
CONFIG_LIBIPW=y
# CONFIG_LIBIPW_DEBUG is not set
# CONFIG_IWLWIFI_LEDS is not set
# CONFIG_HOSTAP is not set

#
# WiMAX Wireless Broadband devices
#
CONFIG_WIMAX_I2400M=y
# CONFIG_WIMAX_I2400M_USB is not set
CONFIG_WIMAX_I2400M_SDIO=y
CONFIG_WIMAX_I2400M_DEBUG_LEVEL=8

#
# USB Network Adapters
#
CONFIG_USB_CATC=y
# CONFIG_USB_KAWETH is not set
CONFIG_USB_PEGASUS=y
# CONFIG_USB_RTL8150 is not set
# CONFIG_USB_USBNET is not set
# CONFIG_WAN is not set
CONFIG_ATM_DRIVERS=y
# CONFIG_ATM_DUMMY is not set
CONFIG_ATM_TCP=y
CONFIG_ATM_LANAI=y
# CONFIG_ATM_ENI is not set
CONFIG_ATM_FIRESTREAM=y
CONFIG_ATM_ZATM=y
CONFIG_ATM_ZATM_DEBUG=y
# CONFIG_ATM_NICSTAR is not set
CONFIG_ATM_IDT77252=y
CONFIG_ATM_IDT77252_DEBUG=y
# CONFIG_ATM_IDT77252_RCV_ALL is not set
CONFIG_ATM_IDT77252_USE_SUNI=y
CONFIG_ATM_AMBASSADOR=y
CONFIG_ATM_AMBASSADOR_DEBUG=y
# CONFIG_ATM_HORIZON is not set
CONFIG_ATM_IA=y
# CONFIG_ATM_IA_DEBUG is not set
# CONFIG_ATM_FORE200E is not set
CONFIG_ATM_HE=y
CONFIG_ATM_HE_USE_SUNI=y
CONFIG_ATM_SOLOS=y
# CONFIG_FDDI is not set
CONFIG_HIPPI=y
# CONFIG_ROADRUNNER is not set
# CONFIG_PLIP is not set
CONFIG_PPP=y
CONFIG_PPP_MULTILINK=y
# CONFIG_PPP_FILTER is not set
CONFIG_PPP_ASYNC=y
CONFIG_PPP_SYNC_TTY=y
CONFIG_PPP_DEFLATE=y
CONFIG_PPP_BSDCOMP=y
CONFIG_PPP_MPPE=y
# CONFIG_PPPOE is not set
CONFIG_PPPOATM=y
CONFIG_PPPOL2TP=y
CONFIG_SLIP=y
CONFIG_SLIP_COMPRESSED=y
CONFIG_SLHC=y
# CONFIG_SLIP_SMART is not set
# CONFIG_SLIP_MODE_SLIP6 is not set
CONFIG_NET_FC=y
CONFIG_NETCONSOLE=y
CONFIG_NETCONSOLE_DYNAMIC=y
CONFIG_NETPOLL=y
# CONFIG_NETPOLL_TRAP is not set
CONFIG_NET_POLL_CONTROLLER=y
CONFIG_VIRTIO_NET=y
CONFIG_ISDN=y
# CONFIG_MISDN is not set
# CONFIG_ISDN_I4L is not set
# CONFIG_ISDN_CAPI is not set
CONFIG_PHONE=y
# CONFIG_PHONE_IXJ is not set

#
# Input device support
#
CONFIG_INPUT=y
CONFIG_INPUT_FF_MEMLESS=y
CONFIG_INPUT_POLLDEV=y

#
# Userland interfaces
#
CONFIG_INPUT_MOUSEDEV=y
CONFIG_INPUT_MOUSEDEV_PSAUX=y
CONFIG_INPUT_MOUSEDEV_SCREEN_X=1024
CONFIG_INPUT_MOUSEDEV_SCREEN_Y=768
CONFIG_INPUT_JOYDEV=y
CONFIG_INPUT_EVDEV=y
# CONFIG_INPUT_EVBUG is not set

#
# Input Device Drivers
#
CONFIG_INPUT_KEYBOARD=y
CONFIG_KEYBOARD_ATKBD=y
CONFIG_KEYBOARD_SUNKBD=y
CONFIG_KEYBOARD_LKKBD=y
CONFIG_KEYBOARD_XTKBD=y
CONFIG_KEYBOARD_NEWTON=y
CONFIG_KEYBOARD_STOWAWAY=y
CONFIG_KEYBOARD_GPIO=y
CONFIG_INPUT_MOUSE=y
CONFIG_MOUSE_PS2=y
# CONFIG_MOUSE_PS2_ALPS is not set
CONFIG_MOUSE_PS2_LOGIPS2PP=y
CONFIG_MOUSE_PS2_SYNAPTICS=y
CONFIG_MOUSE_PS2_LIFEBOOK=y
CONFIG_MOUSE_PS2_TRACKPOINT=y
# CONFIG_MOUSE_PS2_ELANTECH is not set
CONFIG_MOUSE_PS2_TOUCHKIT=y
CONFIG_MOUSE_SERIAL=y
CONFIG_MOUSE_APPLETOUCH=y
CONFIG_MOUSE_BCM5974=y
# CONFIG_MOUSE_VSXXXAA is not set
# CONFIG_MOUSE_GPIO is not set
CONFIG_INPUT_JOYSTICK=y
# CONFIG_JOYSTICK_ANALOG is not set
CONFIG_JOYSTICK_A3D=y
CONFIG_JOYSTICK_ADI=y
CONFIG_JOYSTICK_COBRA=y
CONFIG_JOYSTICK_GF2K=y
CONFIG_JOYSTICK_GRIP=y
CONFIG_JOYSTICK_GRIP_MP=y
CONFIG_JOYSTICK_GUILLEMOT=y
CONFIG_JOYSTICK_INTERACT=y
# CONFIG_JOYSTICK_SIDEWINDER is not set
CONFIG_JOYSTICK_TMDC=y
# CONFIG_JOYSTICK_IFORCE is not set
# CONFIG_JOYSTICK_WARRIOR is not set
# CONFIG_JOYSTICK_MAGELLAN is not set
# CONFIG_JOYSTICK_SPACEORB is not set
# CONFIG_JOYSTICK_SPACEBALL is not set
CONFIG_JOYSTICK_STINGER=y
# CONFIG_JOYSTICK_TWIDJOY is not set
# CONFIG_JOYSTICK_ZHENHUA is not set
CONFIG_JOYSTICK_DB9=y
CONFIG_JOYSTICK_GAMECON=y
CONFIG_JOYSTICK_TURBOGRAFX=y
# CONFIG_JOYSTICK_JOYDUMP is not set
# CONFIG_JOYSTICK_XPAD is not set
CONFIG_INPUT_TABLET=y
# CONFIG_TABLET_USB_ACECAD is not set
CONFIG_TABLET_USB_AIPTEK=y
# CONFIG_TABLET_USB_GTCO is not set
CONFIG_TABLET_USB_KBTAB=y
CONFIG_TABLET_USB_WACOM=y
# CONFIG_INPUT_TOUCHSCREEN is not set
# CONFIG_INPUT_MISC is not set

#
# Hardware I/O ports
#
CONFIG_SERIO=y
CONFIG_SERIO_I8042=y
CONFIG_SERIO_SERPORT=y
CONFIG_SERIO_CT82C710=y
CONFIG_SERIO_PARKBD=y
# CONFIG_SERIO_PCIPS2 is not set
CONFIG_SERIO_LIBPS2=y
# CONFIG_SERIO_RAW is not set
CONFIG_GAMEPORT=y
CONFIG_GAMEPORT_NS558=y
CONFIG_GAMEPORT_L4=y
CONFIG_GAMEPORT_EMU10K1=y
CONFIG_GAMEPORT_FM801=y

#
# Character devices
#
CONFIG_VT=y
CONFIG_CONSOLE_TRANSLATIONS=y
CONFIG_VT_CONSOLE=y
CONFIG_HW_CONSOLE=y
CONFIG_VT_HW_CONSOLE_BINDING=y
CONFIG_DEVKMEM=y
# CONFIG_SERIAL_NONSTANDARD is not set
# CONFIG_NOZOMI is not set

#
# Serial drivers
#
CONFIG_SERIAL_8250=y
CONFIG_SERIAL_8250_CONSOLE=y
CONFIG_FIX_EARLYCON_MEM=y
CONFIG_SERIAL_8250_PCI=y
CONFIG_SERIAL_8250_NR_UARTS=4
CONFIG_SERIAL_8250_RUNTIME_UARTS=4
CONFIG_SERIAL_8250_EXTENDED=y
# CONFIG_SERIAL_8250_MANY_PORTS is not set
CONFIG_SERIAL_8250_SHARE_IRQ=y
CONFIG_SERIAL_8250_DETECT_IRQ=y
CONFIG_SERIAL_8250_RSA=y
# CONFIG_SERIAL_8250_MCA is not set

#
# Non-8250 serial port support
#
CONFIG_SERIAL_CORE=y
CONFIG_SERIAL_CORE_CONSOLE=y
CONFIG_SERIAL_JSM=y
CONFIG_UNIX98_PTYS=y
CONFIG_DEVPTS_MULTIPLE_INSTANCES=y
CONFIG_LEGACY_PTYS=y
CONFIG_LEGACY_PTY_COUNT=256
CONFIG_PRINTER=y
CONFIG_LP_CONSOLE=y
CONFIG_PPDEV=y
CONFIG_HVC_DRIVER=y
# CONFIG_VIRTIO_CONSOLE is not set
CONFIG_IPMI_HANDLER=y
CONFIG_IPMI_PANIC_EVENT=y
CONFIG_IPMI_PANIC_STRING=y
# CONFIG_IPMI_DEVICE_INTERFACE is not set
CONFIG_IPMI_SI=y
CONFIG_IPMI_WATCHDOG=y
# CONFIG_IPMI_POWEROFF is not set
# CONFIG_HW_RANDOM is not set
# CONFIG_NVRAM is not set
CONFIG_R3964=y
CONFIG_APPLICOM=y
# CONFIG_SONYPI is not set
CONFIG_MWAVE=y
CONFIG_PC8736x_GPIO=y
CONFIG_NSC_GPIO=y
CONFIG_CS5535_GPIO=y
# CONFIG_RAW_DRIVER is not set
# CONFIG_HANGCHECK_TIMER is not set
# CONFIG_TCG_TPM is not set
# CONFIG_TELCLOCK is not set
CONFIG_DEVPORT=y
CONFIG_I2C=y
CONFIG_I2C_BOARDINFO=y
CONFIG_I2C_CHARDEV=y
CONFIG_I2C_HELPER_AUTO=y
CONFIG_I2C_ALGOBIT=y
CONFIG_I2C_ALGOPCA=y

#
# I2C Hardware Bus support
#

#
# PC SMBus host controller drivers
#
CONFIG_I2C_ALI1535=y
CONFIG_I2C_ALI1563=y
CONFIG_I2C_ALI15X3=y
# CONFIG_I2C_AMD756 is not set
CONFIG_I2C_AMD8111=y
CONFIG_I2C_I801=y
CONFIG_I2C_ISCH=y
# CONFIG_I2C_PIIX4 is not set
# CONFIG_I2C_NFORCE2 is not set
CONFIG_I2C_SIS5595=y
CONFIG_I2C_SIS630=y
CONFIG_I2C_SIS96X=y
CONFIG_I2C_VIA=y
CONFIG_I2C_VIAPRO=y

#
# I2C system bus drivers (mostly embedded / system-on-chip)
#
CONFIG_I2C_GPIO=y
CONFIG_I2C_OCORES=y
CONFIG_I2C_SIMTEC=y

#
# External I2C/SMBus adapter drivers
#
# CONFIG_I2C_PARPORT is not set
# CONFIG_I2C_PARPORT_LIGHT is not set
CONFIG_I2C_TAOS_EVM=y
CONFIG_I2C_TINY_USB=y

#
# Graphics adapter I2C/DDC channel drivers
#
# CONFIG_I2C_VOODOO3 is not set

#
# Other I2C/SMBus bus drivers
#
CONFIG_I2C_PCA_PLATFORM=y
# CONFIG_SCx200_ACB is not set

#
# Miscellaneous I2C Chip support
#
CONFIG_DS1682=y
CONFIG_SENSORS_PCF8574=y
CONFIG_PCF8575=y
CONFIG_SENSORS_PCF8591=y
CONFIG_SENSORS_MAX6875=y
# CONFIG_SENSORS_TSL2550 is not set
CONFIG_I2C_DEBUG_CORE=y
# CONFIG_I2C_DEBUG_ALGO is not set
# CONFIG_I2C_DEBUG_BUS is not set
# CONFIG_I2C_DEBUG_CHIP is not set
CONFIG_SPI=y
CONFIG_SPI_DEBUG=y
CONFIG_SPI_MASTER=y

#
# SPI Master Controller Drivers
#
CONFIG_SPI_BITBANG=y
CONFIG_SPI_BUTTERFLY=y
# CONFIG_SPI_GPIO is not set
CONFIG_SPI_LM70_LLP=y

#
# SPI Protocol Masters
#
CONFIG_SPI_SPIDEV=y
CONFIG_SPI_TLE62X0=y
CONFIG_ARCH_WANT_OPTIONAL_GPIOLIB=y
CONFIG_GPIOLIB=y
CONFIG_DEBUG_GPIO=y
# CONFIG_GPIO_SYSFS is not set

#
# Memory mapped GPIO expanders:
#

#
# I2C GPIO expanders:
#
CONFIG_GPIO_MAX732X=y
CONFIG_GPIO_PCA953X=y
# CONFIG_GPIO_PCF857X is not set
# CONFIG_GPIO_TWL4030 is not set

#
# PCI GPIO expanders:
#
CONFIG_GPIO_BT8XX=y

#
# SPI GPIO expanders:
#
CONFIG_GPIO_MAX7301=y
CONFIG_GPIO_MCP23S08=y
CONFIG_W1=y
CONFIG_W1_CON=y

#
# 1-wire Bus Masters
#
# CONFIG_W1_MASTER_MATROX is not set
# CONFIG_W1_MASTER_DS2490 is not set
# CONFIG_W1_MASTER_DS2482 is not set
# CONFIG_W1_MASTER_GPIO is not set

#
# 1-wire Slaves
#
# CONFIG_W1_SLAVE_THERM is not set
# CONFIG_W1_SLAVE_SMEM is not set
# CONFIG_W1_SLAVE_DS2431 is not set
CONFIG_W1_SLAVE_DS2433=y
CONFIG_W1_SLAVE_DS2433_CRC=y
CONFIG_W1_SLAVE_DS2760=y
CONFIG_W1_SLAVE_BQ27000=y
# CONFIG_POWER_SUPPLY is not set
CONFIG_HWMON=y
CONFIG_HWMON_VID=y
CONFIG_SENSORS_ABITUGURU=y
CONFIG_SENSORS_ABITUGURU3=y
# CONFIG_SENSORS_AD7414 is not set
# CONFIG_SENSORS_AD7418 is not set
CONFIG_SENSORS_ADCXX=y
CONFIG_SENSORS_ADM1021=y
CONFIG_SENSORS_ADM1025=y
CONFIG_SENSORS_ADM1026=y
CONFIG_SENSORS_ADM1029=y
CONFIG_SENSORS_ADM1031=y
CONFIG_SENSORS_ADM9240=y
CONFIG_SENSORS_ADT7462=y
# CONFIG_SENSORS_ADT7470 is not set
CONFIG_SENSORS_ADT7473=y
CONFIG_SENSORS_ADT7475=y
CONFIG_SENSORS_K8TEMP=y
CONFIG_SENSORS_ASB100=y
# CONFIG_SENSORS_ATXP1 is not set
CONFIG_SENSORS_DS1621=y
CONFIG_SENSORS_I5K_AMB=y
CONFIG_SENSORS_F71805F=y
# CONFIG_SENSORS_F71882FG is not set
CONFIG_SENSORS_F75375S=y
CONFIG_SENSORS_FSCHER=y
CONFIG_SENSORS_FSCPOS=y
CONFIG_SENSORS_FSCHMD=y
CONFIG_SENSORS_GL518SM=y
CONFIG_SENSORS_GL520SM=y
# CONFIG_SENSORS_CORETEMP is not set
CONFIG_SENSORS_IBMAEM=y
CONFIG_SENSORS_IBMPEX=y
CONFIG_SENSORS_IT87=y
# CONFIG_SENSORS_LM63 is not set
CONFIG_SENSORS_LM70=y
CONFIG_SENSORS_LM75=y
# CONFIG_SENSORS_LM77 is not set
# CONFIG_SENSORS_LM78 is not set
CONFIG_SENSORS_LM80=y
# CONFIG_SENSORS_LM83 is not set
CONFIG_SENSORS_LM85=y
# CONFIG_SENSORS_LM87 is not set
# CONFIG_SENSORS_LM90 is not set
# CONFIG_SENSORS_LM92 is not set
# CONFIG_SENSORS_LM93 is not set
# CONFIG_SENSORS_LTC4245 is not set
CONFIG_SENSORS_MAX1111=y
# CONFIG_SENSORS_MAX1619 is not set
CONFIG_SENSORS_MAX6650=y
# CONFIG_SENSORS_PC87360 is not set
CONFIG_SENSORS_PC87427=y
CONFIG_SENSORS_SIS5595=y
CONFIG_SENSORS_DME1737=y
CONFIG_SENSORS_SMSC47M1=y
CONFIG_SENSORS_SMSC47M192=y
CONFIG_SENSORS_SMSC47B397=y
CONFIG_SENSORS_ADS7828=y
CONFIG_SENSORS_THMC50=y
# CONFIG_SENSORS_VIA686A is not set
CONFIG_SENSORS_VT1211=y
CONFIG_SENSORS_VT8231=y
CONFIG_SENSORS_W83781D=y
CONFIG_SENSORS_W83791D=y
# CONFIG_SENSORS_W83792D is not set
CONFIG_SENSORS_W83793=y
CONFIG_SENSORS_W83L785TS=y
# CONFIG_SENSORS_W83L786NG is not set
# CONFIG_SENSORS_W83627HF is not set
CONFIG_SENSORS_W83627EHF=y
CONFIG_SENSORS_HDAPS=y
CONFIG_SENSORS_APPLESMC=y
CONFIG_HWMON_DEBUG_CHIP=y
CONFIG_THERMAL=y
CONFIG_THERMAL_HWMON=y
CONFIG_WATCHDOG=y
CONFIG_WATCHDOG_NOWAYOUT=y

#
# Watchdog Device Drivers
#
CONFIG_SOFT_WATCHDOG=y
CONFIG_ACQUIRE_WDT=y
CONFIG_ADVANTECH_WDT=y
CONFIG_ALIM1535_WDT=y
CONFIG_ALIM7101_WDT=y
# CONFIG_SC520_WDT is not set
# CONFIG_EUROTECH_WDT is not set
CONFIG_IB700_WDT=y
CONFIG_IBMASR=y
# CONFIG_WAFER_WDT is not set
# CONFIG_I6300ESB_WDT is not set
CONFIG_ITCO_WDT=y
CONFIG_ITCO_VENDOR_SUPPORT=y
# CONFIG_IT8712F_WDT is not set
CONFIG_IT87_WDT=y
CONFIG_HP_WATCHDOG=y
CONFIG_SC1200_WDT=y
# CONFIG_PC87413_WDT is not set
CONFIG_RDC321X_WDT=y
CONFIG_60XX_WDT=y
CONFIG_SBC8360_WDT=y
# CONFIG_SBC7240_WDT is not set
CONFIG_CPU5_WDT=y
# CONFIG_SMSC_SCH311X_WDT is not set
CONFIG_SMSC37B787_WDT=y
CONFIG_W83627HF_WDT=y
# CONFIG_W83697HF_WDT is not set
# CONFIG_W83697UG_WDT is not set
CONFIG_W83877F_WDT=y
CONFIG_W83977F_WDT=y
CONFIG_MACHZ_WDT=y
CONFIG_SBC_EPX_C3_WATCHDOG=y

#
# PCI-based Watchdog Cards
#
CONFIG_PCIPCWATCHDOG=y
# CONFIG_WDTPCI is not set

#
# USB-based Watchdog Cards
#
CONFIG_USBPCWATCHDOG=y
CONFIG_SSB_POSSIBLE=y

#
# Sonics Silicon Backplane
#
CONFIG_SSB=y
CONFIG_SSB_SPROM=y
CONFIG_SSB_PCIHOST_POSSIBLE=y
CONFIG_SSB_PCIHOST=y
# CONFIG_SSB_B43_PCI_BRIDGE is not set
CONFIG_SSB_SILENT=y
CONFIG_SSB_DRIVER_PCICORE_POSSIBLE=y
CONFIG_SSB_DRIVER_PCICORE=y

#
# Multifunction device drivers
#
# CONFIG_MFD_CORE is not set
CONFIG_MFD_SM501=y
# CONFIG_MFD_SM501_GPIO is not set
# CONFIG_HTC_PASIC3 is not set
CONFIG_TPS65010=y
CONFIG_TWL4030_CORE=y
# CONFIG_MFD_TMIO is not set
CONFIG_PMIC_DA903X=y
# CONFIG_MFD_WM8400 is not set
# CONFIG_MFD_WM8350_I2C is not set
CONFIG_MFD_PCF50633=y
CONFIG_PCF50633_ADC=y
CONFIG_PCF50633_GPIO=y
CONFIG_REGULATOR=y
CONFIG_REGULATOR_DEBUG=y
# CONFIG_REGULATOR_FIXED_VOLTAGE is not set
# CONFIG_REGULATOR_VIRTUAL_CONSUMER is not set
CONFIG_REGULATOR_BQ24022=y
CONFIG_REGULATOR_DA903X=y
CONFIG_REGULATOR_PCF50633=y

#
# Multimedia devices
#

#
# Multimedia core support
#
CONFIG_VIDEO_DEV=y
CONFIG_VIDEO_V4L2_COMMON=y
# CONFIG_VIDEO_ALLOW_V4L1 is not set
# CONFIG_VIDEO_V4L1_COMPAT is not set
CONFIG_DVB_CORE=y
CONFIG_VIDEO_MEDIA=y

#
# Multimedia drivers
#
CONFIG_VIDEO_SAA7146=y
CONFIG_VIDEO_SAA7146_VV=y
CONFIG_MEDIA_TUNER=y
CONFIG_MEDIA_TUNER_CUSTOMIZE=y
CONFIG_MEDIA_TUNER_SIMPLE=y
CONFIG_MEDIA_TUNER_TDA8290=y
CONFIG_MEDIA_TUNER_TDA827X=y
CONFIG_MEDIA_TUNER_TDA18271=y
CONFIG_MEDIA_TUNER_TDA9887=y
# CONFIG_MEDIA_TUNER_TEA5761 is not set
CONFIG_MEDIA_TUNER_TEA5767=y
# CONFIG_MEDIA_TUNER_MT20XX is not set
CONFIG_MEDIA_TUNER_MT2060=y
CONFIG_MEDIA_TUNER_MT2266=y
CONFIG_MEDIA_TUNER_MT2131=y
CONFIG_MEDIA_TUNER_QT1010=y
# CONFIG_MEDIA_TUNER_XC2028 is not set
CONFIG_MEDIA_TUNER_XC5000=y
# CONFIG_MEDIA_TUNER_MXL5005S is not set
CONFIG_MEDIA_TUNER_MXL5007T=y
CONFIG_VIDEO_V4L2=y
CONFIG_VIDEOBUF_GEN=y
CONFIG_VIDEOBUF_DMA_SG=y
CONFIG_VIDEOBUF_VMALLOC=y
CONFIG_VIDEO_IR=y
CONFIG_VIDEO_TVEEPROM=y
CONFIG_VIDEO_TUNER=y
CONFIG_VIDEO_CAPTURE_DRIVERS=y
CONFIG_VIDEO_ADV_DEBUG=y
CONFIG_VIDEO_FIXED_MINOR_RANGES=y
CONFIG_VIDEO_HELPER_CHIPS_AUTO=y
CONFIG_VIDEO_IR_I2C=y
CONFIG_VIDEO_MSP3400=y
CONFIG_VIDEO_CS5345=y
CONFIG_VIDEO_CS53L32A=y
CONFIG_VIDEO_M52790=y
CONFIG_VIDEO_WM8775=y
CONFIG_VIDEO_WM8739=y
CONFIG_VIDEO_VP27SMPX=y
CONFIG_VIDEO_SAA711X=y
CONFIG_VIDEO_SAA717X=y
CONFIG_VIDEO_TVP5150=y
CONFIG_VIDEO_CX25840=y
CONFIG_VIDEO_CX2341X=y
CONFIG_VIDEO_SAA7127=y
CONFIG_VIDEO_UPD64031A=y
CONFIG_VIDEO_UPD64083=y
# CONFIG_VIDEO_VIVI is not set
# CONFIG_VIDEO_BT848 is not set
# CONFIG_VIDEO_SAA5246A is not set
# CONFIG_VIDEO_SAA5249 is not set
CONFIG_VIDEO_SAA7134=y
# CONFIG_VIDEO_SAA7134_DVB is not set
CONFIG_VIDEO_HEXIUM_ORION=y
# CONFIG_VIDEO_HEXIUM_GEMINI is not set
# CONFIG_VIDEO_CX88 is not set
# CONFIG_VIDEO_CX23885 is not set
CONFIG_VIDEO_AU0828=y
CONFIG_VIDEO_IVTV=y
CONFIG_VIDEO_FB_IVTV=y
CONFIG_VIDEO_CX18=y
# CONFIG_VIDEO_CAFE_CCIC is not set
# CONFIG_SOC_CAMERA is not set
CONFIG_V4L_USB_DRIVERS=y
# CONFIG_USB_VIDEO_CLASS is not set
# CONFIG_USB_GSPCA is not set
CONFIG_VIDEO_PVRUSB2=y
# CONFIG_VIDEO_PVRUSB2_SYSFS is not set
# CONFIG_VIDEO_PVRUSB2_DVB is not set
CONFIG_VIDEO_EM28XX=y
# CONFIG_VIDEO_EM28XX_DVB is not set
CONFIG_VIDEO_USBVISION=y
# CONFIG_USB_ET61X251 is not set
CONFIG_USB_SN9C102=y
# CONFIG_USB_ZC0301 is not set
CONFIG_USB_ZR364XX=y
CONFIG_USB_STKWEBCAM=y
# CONFIG_USB_S2255 is not set
# CONFIG_RADIO_ADAPTERS is not set
# CONFIG_DVB_DYNAMIC_MINORS is not set
CONFIG_DVB_CAPTURE_DRIVERS=y

#
# Supported SAA7146 based PCI Adapters
#
# CONFIG_TTPCI_EEPROM is not set
# CONFIG_DVB_AV7110 is not set
# CONFIG_DVB_BUDGET_CORE is not set

#
# Supported USB Adapters
#
CONFIG_DVB_USB=y
CONFIG_DVB_USB_DEBUG=y
# CONFIG_DVB_USB_A800 is not set
CONFIG_DVB_USB_DIBUSB_MB=y
# CONFIG_DVB_USB_DIBUSB_MB_FAULTY is not set
# CONFIG_DVB_USB_DIBUSB_MC is not set
# CONFIG_DVB_USB_DIB0700 is not set
CONFIG_DVB_USB_UMT_010=y
CONFIG_DVB_USB_CXUSB=y
# CONFIG_DVB_USB_M920X is not set
CONFIG_DVB_USB_GL861=y
# CONFIG_DVB_USB_AU6610 is not set
# CONFIG_DVB_USB_DIGITV is not set
CONFIG_DVB_USB_VP7045=y
CONFIG_DVB_USB_VP702X=y
# CONFIG_DVB_USB_GP8PSK is not set
# CONFIG_DVB_USB_NOVA_T_USB2 is not set
CONFIG_DVB_USB_TTUSB2=y
CONFIG_DVB_USB_DTT200U=y
CONFIG_DVB_USB_OPERA1=y
# CONFIG_DVB_USB_AF9005 is not set
# CONFIG_DVB_USB_DW2102 is not set
CONFIG_DVB_USB_CINERGY_T2=y
# CONFIG_DVB_USB_ANYSEE is not set
CONFIG_DVB_USB_DTV5100=y
CONFIG_DVB_USB_AF9015=y
# CONFIG_DVB_TTUSB_BUDGET is not set
CONFIG_DVB_TTUSB_DEC=y
CONFIG_DVB_SIANO_SMS1XXX=y
# CONFIG_DVB_SIANO_SMS1XXX_SMS_IDS is not set

#
# Supported FlexCopII (B2C2) Adapters
#
CONFIG_DVB_B2C2_FLEXCOP=y
CONFIG_DVB_B2C2_FLEXCOP_PCI=y
# CONFIG_DVB_B2C2_FLEXCOP_USB is not set
CONFIG_DVB_B2C2_FLEXCOP_DEBUG=y

#
# Supported BT878 Adapters
#

#
# Supported Pluto2 Adapters
#
CONFIG_DVB_PLUTO2=y

#
# Supported SDMC DM1105 Adapters
#
# CONFIG_DVB_DM1105 is not set

#
# Supported FireWire (IEEE 1394) Adapters
#
# CONFIG_DVB_FIREDTV is not set

#
# Supported DVB Frontends
#

#
# Customise DVB Frontends
#
CONFIG_DVB_FE_CUSTOMISE=y

#
# Multistandard (satellite) frontends
#
# CONFIG_DVB_STB0899 is not set
# CONFIG_DVB_STB6100 is not set

#
# DVB-S (satellite) frontends
#
CONFIG_DVB_CX24110=y
# CONFIG_DVB_CX24123 is not set
# CONFIG_DVB_MT312 is not set
# CONFIG_DVB_S5H1420 is not set
CONFIG_DVB_STV0288=y
CONFIG_DVB_STB6000=y
CONFIG_DVB_STV0299=y
# CONFIG_DVB_TDA8083 is not set
# CONFIG_DVB_TDA10086 is not set
CONFIG_DVB_TDA8261=y
# CONFIG_DVB_VES1X93 is not set
CONFIG_DVB_TUNER_ITD1000=y
CONFIG_DVB_TUNER_CX24113=y
# CONFIG_DVB_TDA826X is not set
CONFIG_DVB_TUA6100=y
CONFIG_DVB_CX24116=y
CONFIG_DVB_SI21XX=y

#
# DVB-T (terrestrial) frontends
#
# CONFIG_DVB_SP8870 is not set
CONFIG_DVB_SP887X=y
CONFIG_DVB_CX22700=y
# CONFIG_DVB_CX22702 is not set
# CONFIG_DVB_DRX397XD is not set
CONFIG_DVB_L64781=y
CONFIG_DVB_TDA1004X=y
# CONFIG_DVB_NXT6000 is not set
# CONFIG_DVB_MT352 is not set
CONFIG_DVB_ZL10353=y
CONFIG_DVB_DIB3000MB=y
CONFIG_DVB_DIB3000MC=y
CONFIG_DVB_DIB7000M=y
CONFIG_DVB_DIB7000P=y
CONFIG_DVB_TDA10048=y

#
# DVB-C (cable) frontends
#
CONFIG_DVB_VES1820=y
# CONFIG_DVB_TDA10021 is not set
CONFIG_DVB_TDA10023=y
CONFIG_DVB_STV0297=y

#
# ATSC (North American/Korean Terrestrial/Cable DTV) frontends
#
# CONFIG_DVB_NXT200X is not set
# CONFIG_DVB_OR51211 is not set
# CONFIG_DVB_OR51132 is not set
# CONFIG_DVB_BCM3510 is not set
CONFIG_DVB_LGDT330X=y
CONFIG_DVB_LGDT3304=y
CONFIG_DVB_S5H1409=y
CONFIG_DVB_AU8522=y
CONFIG_DVB_S5H1411=y

#
# ISDB-T (terrestrial) frontends
#
# CONFIG_DVB_S921 is not set

#
# Digital terrestrial only tuners/PLL
#
CONFIG_DVB_PLL=y
# CONFIG_DVB_TUNER_DIB0070 is not set

#
# SEC control devices for DVB-S
#
# CONFIG_DVB_LNBP21 is not set
CONFIG_DVB_ISL6405=y
CONFIG_DVB_ISL6421=y
CONFIG_DVB_LGS8GL5=y

#
# Tools to develop new frontends
#
CONFIG_DVB_DUMMY_FE=y
CONFIG_DVB_AF9013=y
CONFIG_DAB=y
CONFIG_USB_DABUSB=y

#
# Graphics support
#
# CONFIG_AGP is not set
CONFIG_DRM=y
CONFIG_DRM_TDFX=y
CONFIG_DRM_R128=y
# CONFIG_DRM_RADEON is not set
CONFIG_DRM_MGA=y
CONFIG_DRM_VIA=y
CONFIG_DRM_SAVAGE=y
CONFIG_VGASTATE=y
CONFIG_VIDEO_OUTPUT_CONTROL=y
CONFIG_FB=y
CONFIG_FIRMWARE_EDID=y
CONFIG_FB_DDC=y
CONFIG_FB_BOOT_VESA_SUPPORT=y
CONFIG_FB_CFB_FILLRECT=y
CONFIG_FB_CFB_COPYAREA=y
CONFIG_FB_CFB_IMAGEBLIT=y
# CONFIG_FB_CFB_REV_PIXELS_IN_BYTE is not set
CONFIG_FB_SYS_FILLRECT=y
CONFIG_FB_SYS_COPYAREA=y
CONFIG_FB_SYS_IMAGEBLIT=y
CONFIG_FB_FOREIGN_ENDIAN=y
# CONFIG_FB_BOTH_ENDIAN is not set
CONFIG_FB_BIG_ENDIAN=y
# CONFIG_FB_LITTLE_ENDIAN is not set
CONFIG_FB_SYS_FOPS=y
CONFIG_FB_DEFERRED_IO=y
CONFIG_FB_HECUBA=y
CONFIG_FB_SVGALIB=y
# CONFIG_FB_MACMODES is not set
CONFIG_FB_BACKLIGHT=y
CONFIG_FB_MODE_HELPERS=y
CONFIG_FB_TILEBLITTING=y

#
# Frame buffer hardware drivers
#
# CONFIG_FB_CIRRUS is not set
CONFIG_FB_PM2=y
# CONFIG_FB_PM2_FIFO_DISCONNECT is not set
CONFIG_FB_CYBER2000=y
# CONFIG_FB_ARC is not set
# CONFIG_FB_ASILIANT is not set
CONFIG_FB_IMSTT=y
# CONFIG_FB_VGA16 is not set
CONFIG_FB_UVESA=y
# CONFIG_FB_VESA is not set
CONFIG_FB_N411=y
# CONFIG_FB_HGA is not set
# CONFIG_FB_S1D13XXX is not set
# CONFIG_FB_NVIDIA is not set
CONFIG_FB_RIVA=y
CONFIG_FB_RIVA_I2C=y
CONFIG_FB_RIVA_DEBUG=y
CONFIG_FB_RIVA_BACKLIGHT=y
CONFIG_FB_LE80578=y
# CONFIG_FB_CARILLO_RANCH is not set
CONFIG_FB_MATROX=y
CONFIG_FB_MATROX_MILLENIUM=y
CONFIG_FB_MATROX_MYSTIQUE=y
CONFIG_FB_MATROX_G=y
CONFIG_FB_MATROX_I2C=y
CONFIG_FB_MATROX_MAVEN=y
CONFIG_FB_MATROX_MULTIHEAD=y
# CONFIG_FB_RADEON is not set
CONFIG_FB_ATY128=y
CONFIG_FB_ATY128_BACKLIGHT=y
# CONFIG_FB_ATY is not set
CONFIG_FB_S3=y
CONFIG_FB_SAVAGE=y
CONFIG_FB_SAVAGE_I2C=y
CONFIG_FB_SAVAGE_ACCEL=y
CONFIG_FB_SIS=y
CONFIG_FB_SIS_300=y
CONFIG_FB_SIS_315=y
CONFIG_FB_VIA=y
CONFIG_FB_NEOMAGIC=y
CONFIG_FB_KYRO=y
CONFIG_FB_3DFX=y
# CONFIG_FB_3DFX_ACCEL is not set
CONFIG_FB_VOODOO1=y
CONFIG_FB_VT8623=y
# CONFIG_FB_CYBLA is not set
CONFIG_FB_TRIDENT=y
# CONFIG_FB_TRIDENT_ACCEL is not set
CONFIG_FB_ARK=y
CONFIG_FB_PM3=y
CONFIG_FB_CARMINE=y
# CONFIG_FB_CARMINE_DRAM_EVAL is not set
CONFIG_CARMINE_DRAM_CUSTOM=y
CONFIG_FB_GEODE=y
CONFIG_FB_GEODE_LX=y
CONFIG_FB_GEODE_GX=y
CONFIG_FB_GEODE_GX1=y
CONFIG_FB_SM501=y
# CONFIG_FB_VIRTUAL is not set
CONFIG_FB_METRONOME=y
CONFIG_FB_MB862XX=y
# CONFIG_FB_MB862XX_PCI_GDC is not set
CONFIG_BACKLIGHT_LCD_SUPPORT=y
CONFIG_LCD_CLASS_DEVICE=y
CONFIG_LCD_LTV350QV=y
CONFIG_LCD_ILI9320=y
# CONFIG_LCD_TDO24M is not set
CONFIG_LCD_VGG2432A4=y
CONFIG_LCD_PLATFORM=y
CONFIG_BACKLIGHT_CLASS_DEVICE=y
# CONFIG_BACKLIGHT_GENERIC is not set
CONFIG_BACKLIGHT_PROGEAR=y
CONFIG_BACKLIGHT_CARILLO_RANCH=y
CONFIG_BACKLIGHT_DA903X=y
# CONFIG_BACKLIGHT_MBP_NVIDIA is not set
# CONFIG_BACKLIGHT_SAHARA is not set

#
# Display device support
#
CONFIG_DISPLAY_SUPPORT=y

#
# Display hardware drivers
#

#
# Console display driver support
#
CONFIG_VGA_CONSOLE=y
# CONFIG_VGACON_SOFT_SCROLLBACK is not set
CONFIG_DUMMY_CONSOLE=y
# CONFIG_FRAMEBUFFER_CONSOLE is not set
CONFIG_FONT_8x16=y
CONFIG_LOGO=y
CONFIG_LOGO_LINUX_MONO=y
# CONFIG_LOGO_LINUX_VGA16 is not set
# CONFIG_LOGO_LINUX_CLUT224 is not set
CONFIG_SOUND=y
# CONFIG_SOUND_OSS_CORE is not set
# CONFIG_SND is not set
# CONFIG_SOUND_PRIME is not set
# CONFIG_HID_SUPPORT is not set
CONFIG_USB_SUPPORT=y
CONFIG_USB_ARCH_HAS_HCD=y
CONFIG_USB_ARCH_HAS_OHCI=y
CONFIG_USB_ARCH_HAS_EHCI=y
CONFIG_USB=y
CONFIG_USB_DEBUG=y
# CONFIG_USB_ANNOUNCE_NEW_DEVICES is not set

#
# Miscellaneous USB options
#
# CONFIG_USB_DEVICEFS is not set
# CONFIG_USB_DEVICE_CLASS is not set
CONFIG_USB_DYNAMIC_MINORS=y
CONFIG_USB_SUSPEND=y
# CONFIG_USB_OTG is not set
CONFIG_USB_OTG_WHITELIST=y
CONFIG_USB_OTG_BLACKLIST_HUB=y
# CONFIG_USB_MON is not set
CONFIG_USB_WUSB=y
CONFIG_USB_WUSB_CBAF=y
# CONFIG_USB_WUSB_CBAF_DEBUG is not set

#
# USB Host Controller Drivers
#
CONFIG_USB_C67X00_HCD=y
CONFIG_USB_EHCI_HCD=y
CONFIG_USB_EHCI_ROOT_HUB_TT=y
CONFIG_USB_EHCI_TT_NEWSCHED=y
CONFIG_USB_OXU210HP_HCD=y
CONFIG_USB_ISP116X_HCD=y
# CONFIG_USB_ISP1760_HCD is not set
CONFIG_USB_OHCI_HCD=y
# CONFIG_USB_OHCI_HCD_SSB is not set
# CONFIG_USB_OHCI_BIG_ENDIAN_DESC is not set
# CONFIG_USB_OHCI_BIG_ENDIAN_MMIO is not set
CONFIG_USB_OHCI_LITTLE_ENDIAN=y
CONFIG_USB_UHCI_HCD=y
CONFIG_USB_U132_HCD=y
# CONFIG_USB_SL811_HCD is not set
CONFIG_USB_R8A66597_HCD=y
# CONFIG_USB_WHCI_HCD is not set
CONFIG_USB_HWA_HCD=y

#
# USB Device Class drivers
#
# CONFIG_USB_ACM is not set
CONFIG_USB_PRINTER=y
# CONFIG_USB_WDM is not set
CONFIG_USB_TMC=y

#
# NOTE: USB_STORAGE depends on SCSI but BLK_DEV_SD may also be needed;
#

#
# see USB_STORAGE Help for more information
#
CONFIG_USB_STORAGE=y
CONFIG_USB_STORAGE_DEBUG=y
# CONFIG_USB_STORAGE_DATAFAB is not set
CONFIG_USB_STORAGE_FREECOM=y
CONFIG_USB_STORAGE_ISD200=y
CONFIG_USB_STORAGE_USBAT=y
CONFIG_USB_STORAGE_SDDR09=y
# CONFIG_USB_STORAGE_SDDR55 is not set
CONFIG_USB_STORAGE_JUMPSHOT=y
CONFIG_USB_STORAGE_ALAUDA=y
# CONFIG_USB_STORAGE_ONETOUCH is not set
CONFIG_USB_STORAGE_KARMA=y
CONFIG_USB_STORAGE_CYPRESS_ATACB=y
# CONFIG_USB_LIBUSUAL is not set

#
# USB Imaging devices
#
# CONFIG_USB_MDC800 is not set
# CONFIG_USB_MICROTEK is not set

#
# USB port drivers
#
CONFIG_USB_USS720=y
CONFIG_USB_SERIAL=y
CONFIG_USB_SERIAL_CONSOLE=y
CONFIG_USB_EZUSB=y
# CONFIG_USB_SERIAL_GENERIC is not set
CONFIG_USB_SERIAL_AIRCABLE=y
CONFIG_USB_SERIAL_ARK3116=y
CONFIG_USB_SERIAL_BELKIN=y
CONFIG_USB_SERIAL_CH341=y
CONFIG_USB_SERIAL_WHITEHEAT=y
CONFIG_USB_SERIAL_DIGI_ACCELEPORT=y
CONFIG_USB_SERIAL_CP2101=y
CONFIG_USB_SERIAL_CYPRESS_M8=y
CONFIG_USB_SERIAL_EMPEG=y
CONFIG_USB_SERIAL_FTDI_SIO=y
CONFIG_USB_SERIAL_FUNSOFT=y
CONFIG_USB_SERIAL_VISOR=y
CONFIG_USB_SERIAL_IPAQ=y
# CONFIG_USB_SERIAL_IR is not set
# CONFIG_USB_SERIAL_EDGEPORT is not set
CONFIG_USB_SERIAL_EDGEPORT_TI=y
CONFIG_USB_SERIAL_GARMIN=y
# CONFIG_USB_SERIAL_IPW is not set
CONFIG_USB_SERIAL_IUU=y
CONFIG_USB_SERIAL_KEYSPAN_PDA=y
CONFIG_USB_SERIAL_KEYSPAN=y
CONFIG_USB_SERIAL_KEYSPAN_MPR=y
CONFIG_USB_SERIAL_KEYSPAN_USA28=y
# CONFIG_USB_SERIAL_KEYSPAN_USA28X is not set
CONFIG_USB_SERIAL_KEYSPAN_USA28XA=y
CONFIG_USB_SERIAL_KEYSPAN_USA28XB=y
CONFIG_USB_SERIAL_KEYSPAN_USA19=y
# CONFIG_USB_SERIAL_KEYSPAN_USA18X is not set
# CONFIG_USB_SERIAL_KEYSPAN_USA19W is not set
CONFIG_USB_SERIAL_KEYSPAN_USA19QW=y
CONFIG_USB_SERIAL_KEYSPAN_USA19QI=y
CONFIG_USB_SERIAL_KEYSPAN_USA49W=y
CONFIG_USB_SERIAL_KEYSPAN_USA49WLC=y
# CONFIG_USB_SERIAL_KLSI is not set
CONFIG_USB_SERIAL_KOBIL_SCT=y
CONFIG_USB_SERIAL_MCT_U232=y
# CONFIG_USB_SERIAL_MOS7720 is not set
CONFIG_USB_SERIAL_MOS7840=y
CONFIG_USB_SERIAL_MOTOROLA=y
CONFIG_USB_SERIAL_NAVMAN=y
# CONFIG_USB_SERIAL_PL2303 is not set
# CONFIG_USB_SERIAL_OTI6858 is not set
CONFIG_USB_SERIAL_SPCP8X5=y
CONFIG_USB_SERIAL_HP4X=y
CONFIG_USB_SERIAL_SAFE=y
CONFIG_USB_SERIAL_SAFE_PADDED=y
# CONFIG_USB_SERIAL_SIEMENS_MPI is not set
# CONFIG_USB_SERIAL_SIERRAWIRELESS is not set
CONFIG_USB_SERIAL_TI=y
CONFIG_USB_SERIAL_CYBERJACK=y
# CONFIG_USB_SERIAL_XIRCOM is not set
CONFIG_USB_SERIAL_OPTION=y
# CONFIG_USB_SERIAL_OMNINET is not set
CONFIG_USB_SERIAL_OPTICON=y
CONFIG_USB_SERIAL_DEBUG=y

#
# USB Miscellaneous drivers
#
CONFIG_USB_EMI62=y
CONFIG_USB_EMI26=y
CONFIG_USB_ADUTUX=y
# CONFIG_USB_SEVSEG is not set
CONFIG_USB_RIO500=y
CONFIG_USB_LEGOTOWER=y
# CONFIG_USB_LCD is not set
# CONFIG_USB_BERRY_CHARGE is not set
# CONFIG_USB_LED is not set
CONFIG_USB_CYPRESS_CY7C63=y
# CONFIG_USB_CYTHERM is not set
CONFIG_USB_PHIDGET=y
CONFIG_USB_PHIDGETKIT=y
CONFIG_USB_PHIDGETMOTORCONTROL=y
CONFIG_USB_PHIDGETSERVO=y
CONFIG_USB_IDMOUSE=y
CONFIG_USB_FTDI_ELAN=y
CONFIG_USB_APPLEDISPLAY=y
CONFIG_USB_SISUSBVGA=y
CONFIG_USB_SISUSBVGA_CON=y
CONFIG_USB_LD=y
CONFIG_USB_TRANCEVIBRATOR=y
CONFIG_USB_IOWARRIOR=y
CONFIG_USB_ISIGHTFW=y
# CONFIG_USB_VST is not set
# CONFIG_USB_ATM is not set
# CONFIG_USB_GADGET is not set

#
# OTG and related infrastructure
#
CONFIG_USB_OTG_UTILS=y
CONFIG_USB_GPIO_VBUS=y
CONFIG_TWL4030_USB=y
CONFIG_UWB=y
CONFIG_UWB_HWA=y
# CONFIG_UWB_WHCI is not set
CONFIG_UWB_WLP=y
# CONFIG_UWB_I1480U is not set
CONFIG_MMC=y
CONFIG_MMC_DEBUG=y
CONFIG_MMC_UNSAFE_RESUME=y

#
# MMC/SD/SDIO Card Drivers
#
CONFIG_MMC_BLOCK=y
CONFIG_MMC_BLOCK_BOUNCE=y
CONFIG_SDIO_UART=y
CONFIG_MMC_TEST=y

#
# MMC/SD/SDIO Host Controller Drivers
#
CONFIG_MMC_SDHCI=y
CONFIG_MMC_SDHCI_PCI=y
# CONFIG_MMC_RICOH_MMC is not set
# CONFIG_MMC_WBSD is not set
CONFIG_MMC_TIFM_SD=y
CONFIG_MEMSTICK=y
# CONFIG_MEMSTICK_DEBUG is not set

#
# MemoryStick drivers
#
CONFIG_MEMSTICK_UNSAFE_RESUME=y
CONFIG_MSPRO_BLOCK=y

#
# MemoryStick Host Controller Drivers
#
# CONFIG_MEMSTICK_TIFM_MS is not set
CONFIG_MEMSTICK_JMICRON_38X=y
CONFIG_NEW_LEDS=y
CONFIG_LEDS_CLASS=y

#
# LED drivers
#
# CONFIG_LEDS_ALIX2 is not set
CONFIG_LEDS_PCA9532=y
CONFIG_LEDS_GPIO=y
CONFIG_LEDS_CLEVO_MAIL=y
CONFIG_LEDS_PCA955X=y
CONFIG_LEDS_DA903X=y

#
# LED Triggers
#
# CONFIG_LEDS_TRIGGERS is not set
CONFIG_ACCESSIBILITY=y
CONFIG_A11Y_BRAILLE_CONSOLE=y
# CONFIG_INFINIBAND is not set
CONFIG_EDAC=y

#
# Reporting subsystems
#
CONFIG_EDAC_DEBUG=y
CONFIG_EDAC_MM_EDAC=y
# CONFIG_EDAC_AMD76X is not set
# CONFIG_EDAC_E7XXX is not set
CONFIG_EDAC_E752X=y
CONFIG_EDAC_I82875P=y
# CONFIG_EDAC_I82975X is not set
CONFIG_EDAC_I3000=y
CONFIG_EDAC_X38=y
# CONFIG_EDAC_I5400 is not set
# CONFIG_EDAC_I82860 is not set
CONFIG_EDAC_R82600=y
CONFIG_EDAC_I5000=y
CONFIG_EDAC_I5100=y
CONFIG_RTC_LIB=y
CONFIG_RTC_CLASS=y
CONFIG_RTC_HCTOSYS=y
CONFIG_RTC_HCTOSYS_DEVICE="rtc0"
CONFIG_RTC_DEBUG=y

#
# RTC interfaces
#
# CONFIG_RTC_INTF_SYSFS is not set
CONFIG_RTC_INTF_PROC=y
CONFIG_RTC_INTF_DEV=y
CONFIG_RTC_INTF_DEV_UIE_EMUL=y
CONFIG_RTC_DRV_TEST=y

#
# I2C RTC drivers
#
CONFIG_RTC_DRV_DS1307=y
CONFIG_RTC_DRV_DS1374=y
# CONFIG_RTC_DRV_DS1672 is not set
CONFIG_RTC_DRV_MAX6900=y
CONFIG_RTC_DRV_RS5C372=y
# CONFIG_RTC_DRV_ISL1208 is not set
CONFIG_RTC_DRV_X1205=y
CONFIG_RTC_DRV_PCF8563=y
CONFIG_RTC_DRV_PCF8583=y
CONFIG_RTC_DRV_M41T80=y
CONFIG_RTC_DRV_M41T80_WDT=y
CONFIG_RTC_DRV_TWL4030=y
CONFIG_RTC_DRV_S35390A=y
CONFIG_RTC_DRV_FM3130=y
CONFIG_RTC_DRV_RX8581=y

#
# SPI RTC drivers
#
# CONFIG_RTC_DRV_M41T94 is not set
CONFIG_RTC_DRV_DS1305=y
CONFIG_RTC_DRV_DS1390=y
CONFIG_RTC_DRV_MAX6902=y
CONFIG_RTC_DRV_R9701=y
CONFIG_RTC_DRV_RS5C348=y
CONFIG_RTC_DRV_DS3234=y

#
# Platform RTC drivers
#
CONFIG_RTC_DRV_CMOS=y
CONFIG_RTC_DRV_DS1286=y
CONFIG_RTC_DRV_DS1511=y
CONFIG_RTC_DRV_DS1553=y
CONFIG_RTC_DRV_DS1742=y
CONFIG_RTC_DRV_STK17TA8=y
CONFIG_RTC_DRV_M48T86=y
CONFIG_RTC_DRV_M48T35=y
CONFIG_RTC_DRV_M48T59=y
CONFIG_RTC_DRV_BQ4802=y
CONFIG_RTC_DRV_V3020=y
CONFIG_RTC_DRV_PCF50633=y

#
# on-CPU RTC drivers
#
# CONFIG_DMADEVICES is not set
# CONFIG_AUXDISPLAY is not set
CONFIG_UIO=y
CONFIG_UIO_CIF=y
# CONFIG_UIO_PDRV is not set
# CONFIG_UIO_PDRV_GENIRQ is not set
# CONFIG_UIO_SMX is not set
CONFIG_UIO_SERCOS3=y
# CONFIG_STAGING is not set
CONFIG_X86_PLATFORM_DEVICES=y

#
# Firmware Drivers
#
CONFIG_EDD=y
CONFIG_EDD_OFF=y
CONFIG_FIRMWARE_MEMMAP=y
CONFIG_DELL_RBU=y
CONFIG_DCDBAS=y
CONFIG_DMIID=y
# CONFIG_ISCSI_IBFT_FIND is not set

#
# File systems
#
CONFIG_EXT2_FS=y
# CONFIG_EXT2_FS_XATTR is not set
CONFIG_EXT2_FS_XIP=y
CONFIG_EXT3_FS=y
CONFIG_EXT3_FS_XATTR=y
CONFIG_EXT3_FS_POSIX_ACL=y
CONFIG_EXT3_FS_SECURITY=y
CONFIG_EXT4_FS=y
# CONFIG_EXT4DEV_COMPAT is not set
# CONFIG_EXT4_FS_XATTR is not set
CONFIG_FS_XIP=y
CONFIG_JBD=y
CONFIG_JBD_DEBUG=y
CONFIG_JBD2=y
CONFIG_JBD2_DEBUG=y
CONFIG_FS_MBCACHE=y
CONFIG_REISERFS_FS=y
CONFIG_REISERFS_CHECK=y
CONFIG_REISERFS_PROC_INFO=y
# CONFIG_REISERFS_FS_XATTR is not set
CONFIG_JFS_FS=y
# CONFIG_JFS_POSIX_ACL is not set
CONFIG_JFS_SECURITY=y
CONFIG_JFS_DEBUG=y
# CONFIG_JFS_STATISTICS is not set
CONFIG_FS_POSIX_ACL=y
CONFIG_FILE_LOCKING=y
CONFIG_XFS_FS=y
CONFIG_XFS_QUOTA=y
CONFIG_XFS_POSIX_ACL=y
CONFIG_XFS_RT=y
# CONFIG_XFS_DEBUG is not set
# CONFIG_GFS2_FS is not set
CONFIG_OCFS2_FS=y
CONFIG_OCFS2_FS_O2CB=y
# CONFIG_OCFS2_FS_USERSPACE_CLUSTER is not set
# CONFIG_OCFS2_FS_STATS is not set
CONFIG_OCFS2_DEBUG_MASKLOG=y
CONFIG_OCFS2_DEBUG_FS=y
CONFIG_OCFS2_FS_POSIX_ACL=y
CONFIG_BTRFS_FS=y
CONFIG_BTRFS_FS_POSIX_ACL=y
CONFIG_DNOTIFY=y
CONFIG_INOTIFY=y
# CONFIG_INOTIFY_USER is not set
CONFIG_QUOTA=y
CONFIG_QUOTA_NETLINK_INTERFACE=y
CONFIG_PRINT_QUOTA_WARNING=y
CONFIG_QUOTA_TREE=y
CONFIG_QFMT_V1=y
# CONFIG_QFMT_V2 is not set
CONFIG_QUOTACTL=y
CONFIG_AUTOFS_FS=y
# CONFIG_AUTOFS4_FS is not set
# CONFIG_FUSE_FS is not set

#
# CD-ROM/DVD Filesystems
#
CONFIG_ISO9660_FS=y
CONFIG_JOLIET=y
CONFIG_ZISOFS=y
CONFIG_UDF_FS=y
CONFIG_UDF_NLS=y

#
# DOS/FAT/NT Filesystems
#
CONFIG_FAT_FS=y
CONFIG_MSDOS_FS=y
CONFIG_VFAT_FS=y
CONFIG_FAT_DEFAULT_CODEPAGE=437
CONFIG_FAT_DEFAULT_IOCHARSET="iso8859-1"
CONFIG_NTFS_FS=y
# CONFIG_NTFS_DEBUG is not set
# CONFIG_NTFS_RW is not set

#
# Pseudo filesystems
#
CONFIG_PROC_FS=y
CONFIG_PROC_KCORE=y
# CONFIG_PROC_SYSCTL is not set
CONFIG_PROC_PAGE_MONITOR=y
CONFIG_SYSFS=y
CONFIG_TMPFS=y
# CONFIG_TMPFS_POSIX_ACL is not set
CONFIG_HUGETLBFS=y
CONFIG_HUGETLB_PAGE=y
CONFIG_CONFIGFS_FS=y
CONFIG_MISC_FILESYSTEMS=y
CONFIG_ADFS_FS=y
# CONFIG_ADFS_FS_RW is not set
CONFIG_AFFS_FS=y
CONFIG_HFS_FS=y
CONFIG_HFSPLUS_FS=y
CONFIG_BEFS_FS=y
# CONFIG_BEFS_DEBUG is not set
CONFIG_BFS_FS=y
CONFIG_EFS_FS=y
CONFIG_CRAMFS=y
# CONFIG_SQUASHFS is not set
CONFIG_VXFS_FS=y
CONFIG_MINIX_FS=y
CONFIG_OMFS_FS=y
# CONFIG_HPFS_FS is not set
CONFIG_QNX4FS_FS=y
CONFIG_ROMFS_FS=y
CONFIG_SYSV_FS=y
# CONFIG_UFS_FS is not set
CONFIG_NETWORK_FILESYSTEMS=y
# CONFIG_NFS_FS is not set
CONFIG_NFSD=y
CONFIG_NFSD_V2_ACL=y
CONFIG_NFSD_V3=y
CONFIG_NFSD_V3_ACL=y
# CONFIG_NFSD_V4 is not set
CONFIG_LOCKD=y
CONFIG_LOCKD_V4=y
CONFIG_EXPORTFS=y
CONFIG_NFS_ACL_SUPPORT=y
CONFIG_NFS_COMMON=y
CONFIG_SUNRPC=y
CONFIG_SUNRPC_GSS=y
CONFIG_SUNRPC_REGISTER_V4=y
# CONFIG_RPCSEC_GSS_KRB5 is not set
CONFIG_RPCSEC_GSS_SPKM3=y
# CONFIG_SMB_FS is not set
CONFIG_CIFS=y
CONFIG_CIFS_STATS=y
# CONFIG_CIFS_STATS2 is not set
CONFIG_CIFS_WEAK_PW_HASH=y
CONFIG_CIFS_XATTR=y
CONFIG_CIFS_POSIX=y
CONFIG_CIFS_DEBUG2=y
# CONFIG_CIFS_EXPERIMENTAL is not set
CONFIG_NCP_FS=y
# CONFIG_NCPFS_PACKET_SIGNING is not set
CONFIG_NCPFS_IOCTL_LOCKING=y
# CONFIG_NCPFS_STRONG is not set
CONFIG_NCPFS_NFS_NS=y
CONFIG_NCPFS_OS2_NS=y
CONFIG_NCPFS_SMALLDOS=y
CONFIG_NCPFS_NLS=y
CONFIG_NCPFS_EXTRAS=y
# CONFIG_CODA_FS is not set
# CONFIG_AFS_FS is not set

#
# Partition Types
#
CONFIG_PARTITION_ADVANCED=y
CONFIG_ACORN_PARTITION=y
CONFIG_ACORN_PARTITION_CUMANA=y
# CONFIG_ACORN_PARTITION_EESOX is not set
CONFIG_ACORN_PARTITION_ICS=y
CONFIG_ACORN_PARTITION_ADFS=y
CONFIG_ACORN_PARTITION_POWERTEC=y
CONFIG_ACORN_PARTITION_RISCIX=y
# CONFIG_OSF_PARTITION is not set
CONFIG_AMIGA_PARTITION=y
# CONFIG_ATARI_PARTITION is not set
CONFIG_MAC_PARTITION=y
CONFIG_MSDOS_PARTITION=y
CONFIG_BSD_DISKLABEL=y
# CONFIG_MINIX_SUBPARTITION is not set
# CONFIG_SOLARIS_X86_PARTITION is not set
CONFIG_UNIXWARE_DISKLABEL=y
CONFIG_LDM_PARTITION=y
CONFIG_LDM_DEBUG=y
CONFIG_SGI_PARTITION=y
CONFIG_ULTRIX_PARTITION=y
# CONFIG_SUN_PARTITION is not set
CONFIG_KARMA_PARTITION=y
CONFIG_EFI_PARTITION=y
CONFIG_SYSV68_PARTITION=y
CONFIG_NLS=y
CONFIG_NLS_DEFAULT="iso8859-1"
# CONFIG_NLS_CODEPAGE_437 is not set
# CONFIG_NLS_CODEPAGE_737 is not set
CONFIG_NLS_CODEPAGE_775=y
# CONFIG_NLS_CODEPAGE_850 is not set
CONFIG_NLS_CODEPAGE_852=y
CONFIG_NLS_CODEPAGE_855=y
CONFIG_NLS_CODEPAGE_857=y
CONFIG_NLS_CODEPAGE_860=y
# CONFIG_NLS_CODEPAGE_861 is not set
CONFIG_NLS_CODEPAGE_862=y
# CONFIG_NLS_CODEPAGE_863 is not set
CONFIG_NLS_CODEPAGE_864=y
# CONFIG_NLS_CODEPAGE_865 is not set
# CONFIG_NLS_CODEPAGE_866 is not set
CONFIG_NLS_CODEPAGE_869=y
# CONFIG_NLS_CODEPAGE_936 is not set
# CONFIG_NLS_CODEPAGE_950 is not set
CONFIG_NLS_CODEPAGE_932=y
CONFIG_NLS_CODEPAGE_949=y
# CONFIG_NLS_CODEPAGE_874 is not set
# CONFIG_NLS_ISO8859_8 is not set
CONFIG_NLS_CODEPAGE_1250=y
# CONFIG_NLS_CODEPAGE_1251 is not set
CONFIG_NLS_ASCII=y
CONFIG_NLS_ISO8859_1=y
# CONFIG_NLS_ISO8859_2 is not set
# CONFIG_NLS_ISO8859_3 is not set
# CONFIG_NLS_ISO8859_4 is not set
CONFIG_NLS_ISO8859_5=y
# CONFIG_NLS_ISO8859_6 is not set
CONFIG_NLS_ISO8859_7=y
# CONFIG_NLS_ISO8859_9 is not set
# CONFIG_NLS_ISO8859_13 is not set
# CONFIG_NLS_ISO8859_14 is not set
CONFIG_NLS_ISO8859_15=y
CONFIG_NLS_KOI8_R=y
CONFIG_NLS_KOI8_U=y
CONFIG_NLS_UTF8=y
CONFIG_DLM=y
CONFIG_DLM_DEBUG=y

#
# Kernel hacking
#
CONFIG_TRACE_IRQFLAGS_SUPPORT=y
CONFIG_PRINTK_TIME=y
CONFIG_ENABLE_WARN_DEPRECATED=y
# CONFIG_ENABLE_MUST_CHECK is not set
CONFIG_FRAME_WARN=1024
CONFIG_MAGIC_SYSRQ=y
CONFIG_UNUSED_SYMBOLS=y
CONFIG_DEBUG_FS=y
# CONFIG_HEADERS_CHECK is not set
CONFIG_DEBUG_KERNEL=y
# CONFIG_DEBUG_SHIRQ is not set
CONFIG_DETECT_SOFTLOCKUP=y
CONFIG_BOOTPARAM_SOFTLOCKUP_PANIC=y
CONFIG_BOOTPARAM_SOFTLOCKUP_PANIC_VALUE=1
CONFIG_SCHED_DEBUG=y
CONFIG_SCHEDSTATS=y
CONFIG_TIMER_STATS=y
CONFIG_DEBUG_OBJECTS=y
CONFIG_DEBUG_OBJECTS_SELFTEST=y
CONFIG_DEBUG_OBJECTS_FREE=y
CONFIG_DEBUG_OBJECTS_TIMERS=y
CONFIG_DEBUG_OBJECTS_ENABLE_DEFAULT=1
CONFIG_SLUB_DEBUG_ON=y
CONFIG_SLUB_STATS=y
CONFIG_DEBUG_PREEMPT=y
CONFIG_DEBUG_RT_MUTEXES=y
CONFIG_DEBUG_PI_LIST=y
# CONFIG_RT_MUTEX_TESTER is not set
CONFIG_DEBUG_SPINLOCK=y
CONFIG_DEBUG_MUTEXES=y
CONFIG_DEBUG_LOCK_ALLOC=y
CONFIG_PROVE_LOCKING=y
CONFIG_LOCKDEP=y
CONFIG_LOCK_STAT=y
CONFIG_DEBUG_LOCKDEP=y
CONFIG_TRACE_IRQFLAGS=y
CONFIG_DEBUG_SPINLOCK_SLEEP=y
CONFIG_DEBUG_LOCKING_API_SELFTESTS=y
CONFIG_STACKTRACE=y
# CONFIG_DEBUG_KOBJECT is not set
# CONFIG_DEBUG_HIGHMEM is not set
# CONFIG_DEBUG_INFO is not set
CONFIG_DEBUG_VM=y
# CONFIG_DEBUG_VIRTUAL is not set
CONFIG_DEBUG_WRITECOUNT=y
CONFIG_DEBUG_MEMORY_INIT=y
CONFIG_DEBUG_LIST=y
CONFIG_DEBUG_SG=y
CONFIG_DEBUG_NOTIFIERS=y
CONFIG_ARCH_WANT_FRAME_POINTERS=y
CONFIG_FRAME_POINTER=y
CONFIG_BOOT_PRINTK_DELAY=y
# CONFIG_RCU_TORTURE_TEST is not set
CONFIG_BACKTRACE_SELF_TEST=y
# CONFIG_DEBUG_BLOCK_EXT_DEVT is not set
CONFIG_FAULT_INJECTION=y
CONFIG_FAILSLAB=y
# CONFIG_FAIL_PAGE_ALLOC is not set
# CONFIG_FAIL_MAKE_REQUEST is not set
CONFIG_FAIL_IO_TIMEOUT=y
# CONFIG_FAULT_INJECTION_DEBUG_FS is not set
CONFIG_LATENCYTOP=y
CONFIG_SYSCTL_SYSCALL_CHECK=y
CONFIG_USER_STACKTRACE_SUPPORT=y
CONFIG_NOP_TRACER=y
CONFIG_HAVE_FUNCTION_TRACER=y
CONFIG_HAVE_FUNCTION_GRAPH_TRACER=y
CONFIG_HAVE_FUNCTION_TRACE_MCOUNT_TEST=y
CONFIG_HAVE_DYNAMIC_FTRACE=y
CONFIG_HAVE_FTRACE_MCOUNT_RECORD=y
CONFIG_HAVE_HW_BRANCH_TRACER=y
CONFIG_TRACER_MAX_TRACE=y
CONFIG_RING_BUFFER=y
CONFIG_TRACING=y

#
# Tracers
#
CONFIG_FUNCTION_TRACER=y
CONFIG_FUNCTION_GRAPH_TRACER=y
CONFIG_IRQSOFF_TRACER=y
# CONFIG_PREEMPT_TRACER is not set
CONFIG_SYSPROF_TRACER=y
CONFIG_SCHED_TRACER=y
CONFIG_CONTEXT_SWITCH_TRACER=y
CONFIG_BOOT_TRACER=y
# CONFIG_TRACE_BRANCH_PROFILING is not set
CONFIG_POWER_TRACER=y
# CONFIG_STACK_TRACER is not set
CONFIG_HW_BRANCH_TRACER=y
# CONFIG_DYNAMIC_FTRACE is not set
CONFIG_MMIOTRACE=y
# CONFIG_PROVIDE_OHCI1394_DMA_INIT is not set
# CONFIG_FIREWIRE_OHCI_REMOTE_DMA is not set
CONFIG_DYNAMIC_PRINTK_DEBUG=y
CONFIG_SAMPLES=y
# CONFIG_SAMPLE_KOBJECT is not set
CONFIG_HAVE_ARCH_KGDB=y
# CONFIG_KGDB is not set
CONFIG_STRICT_DEVMEM=y
CONFIG_X86_VERBOSE_BOOTUP=y
CONFIG_EARLY_PRINTK=y
# CONFIG_EARLY_PRINTK_DBGP is not set
CONFIG_DEBUG_STACKOVERFLOW=y
# CONFIG_DEBUG_STACK_USAGE is not set
CONFIG_DEBUG_PAGEALLOC=y
CONFIG_X86_PTDUMP=y
CONFIG_DEBUG_RODATA=y
CONFIG_DEBUG_RODATA_TEST=y
CONFIG_4KSTACKS=y
# CONFIG_DOUBLEFAULT is not set
CONFIG_HAVE_MMIOTRACE_SUPPORT=y
CONFIG_IO_DELAY_TYPE_0X80=0
CONFIG_IO_DELAY_TYPE_0XED=1
CONFIG_IO_DELAY_TYPE_UDELAY=2
CONFIG_IO_DELAY_TYPE_NONE=3
# CONFIG_IO_DELAY_0X80 is not set
# CONFIG_IO_DELAY_0XED is not set
# CONFIG_IO_DELAY_UDELAY is not set
CONFIG_IO_DELAY_NONE=y
CONFIG_DEFAULT_IO_DELAY_TYPE=3
# CONFIG_DEBUG_BOOT_PARAMS is not set
# CONFIG_CPA_DEBUG is not set
CONFIG_OPTIMIZE_INLINING=y

#
# Security options
#
# CONFIG_KEYS is not set
CONFIG_SECURITY=y
CONFIG_SECURITYFS=y
CONFIG_SECURITY_NETWORK=y
# CONFIG_SECURITY_NETWORK_XFRM is not set
CONFIG_SECURITY_PATH=y
CONFIG_SECURITY_FILE_CAPABILITIES=y
# CONFIG_SECURITY_ROOTPLUG is not set
CONFIG_SECURITY_DEFAULT_MMAP_MIN_ADDR=0
# CONFIG_SECURITY_SMACK is not set
CONFIG_CRYPTO=y

#
# Crypto core or helper
#
# CONFIG_CRYPTO_FIPS is not set
CONFIG_CRYPTO_ALGAPI=y
CONFIG_CRYPTO_ALGAPI2=y
CONFIG_CRYPTO_AEAD=y
CONFIG_CRYPTO_AEAD2=y
CONFIG_CRYPTO_BLKCIPHER=y
CONFIG_CRYPTO_BLKCIPHER2=y
CONFIG_CRYPTO_HASH=y
CONFIG_CRYPTO_HASH2=y
CONFIG_CRYPTO_RNG=y
CONFIG_CRYPTO_RNG2=y
CONFIG_CRYPTO_MANAGER=y
CONFIG_CRYPTO_MANAGER2=y
CONFIG_CRYPTO_GF128MUL=y
CONFIG_CRYPTO_NULL=y
CONFIG_CRYPTO_CRYPTD=y
# CONFIG_CRYPTO_AUTHENC is not set

#
# Authenticated Encryption with Associated Data
#
CONFIG_CRYPTO_CCM=y
# CONFIG_CRYPTO_GCM is not set
CONFIG_CRYPTO_SEQIV=y

#
# Block modes
#
CONFIG_CRYPTO_CBC=y
CONFIG_CRYPTO_CTR=y
CONFIG_CRYPTO_CTS=y
CONFIG_CRYPTO_ECB=y
# CONFIG_CRYPTO_LRW is not set
CONFIG_CRYPTO_PCBC=y
CONFIG_CRYPTO_XTS=y

#
# Hash modes
#
CONFIG_CRYPTO_HMAC=y
# CONFIG_CRYPTO_XCBC is not set

#
# Digest
#
CONFIG_CRYPTO_CRC32C=y
CONFIG_CRYPTO_CRC32C_INTEL=y
CONFIG_CRYPTO_MD4=y
CONFIG_CRYPTO_MD5=y
CONFIG_CRYPTO_MICHAEL_MIC=y
CONFIG_CRYPTO_RMD128=y
CONFIG_CRYPTO_RMD160=y
CONFIG_CRYPTO_RMD256=y
# CONFIG_CRYPTO_RMD320 is not set
CONFIG_CRYPTO_SHA1=y
# CONFIG_CRYPTO_SHA256 is not set
CONFIG_CRYPTO_SHA512=y
CONFIG_CRYPTO_TGR192=y
CONFIG_CRYPTO_WP512=y

#
# Ciphers
#
CONFIG_CRYPTO_AES=y
CONFIG_CRYPTO_AES_586=y
# CONFIG_CRYPTO_ANUBIS is not set
CONFIG_CRYPTO_ARC4=y
CONFIG_CRYPTO_BLOWFISH=y
# CONFIG_CRYPTO_CAMELLIA is not set
CONFIG_CRYPTO_CAST5=y
CONFIG_CRYPTO_CAST6=y
CONFIG_CRYPTO_DES=y
# CONFIG_CRYPTO_FCRYPT is not set
CONFIG_CRYPTO_KHAZAD=y
# CONFIG_CRYPTO_SALSA20 is not set
CONFIG_CRYPTO_SALSA20_586=y
CONFIG_CRYPTO_SEED=y
CONFIG_CRYPTO_SERPENT=y
# CONFIG_CRYPTO_TEA is not set
CONFIG_CRYPTO_TWOFISH=y
CONFIG_CRYPTO_TWOFISH_COMMON=y
CONFIG_CRYPTO_TWOFISH_586=y

#
# Compression
#
CONFIG_CRYPTO_DEFLATE=y
# CONFIG_CRYPTO_LZO is not set

#
# Random Number Generation
#
# CONFIG_CRYPTO_ANSI_CPRNG is not set
CONFIG_CRYPTO_HW=y
CONFIG_CRYPTO_DEV_PADLOCK=y
CONFIG_CRYPTO_DEV_PADLOCK_AES=y
# CONFIG_CRYPTO_DEV_PADLOCK_SHA is not set
CONFIG_CRYPTO_DEV_GEODE=y
# CONFIG_CRYPTO_DEV_HIFN_795X is not set
CONFIG_HAVE_KVM=y
CONFIG_VIRTUALIZATION=y
CONFIG_KVM=y
CONFIG_KVM_INTEL=y
CONFIG_KVM_AMD=y
CONFIG_KVM_TRACE=y
CONFIG_LGUEST=y
CONFIG_VIRTIO=y
CONFIG_VIRTIO_RING=y
CONFIG_VIRTIO_PCI=y
# CONFIG_VIRTIO_BALLOON is not set

#
# Library routines
#
CONFIG_BITREVERSE=y
CONFIG_GENERIC_FIND_FIRST_BIT=y
CONFIG_GENERIC_FIND_NEXT_BIT=y
CONFIG_GENERIC_FIND_LAST_BIT=y
CONFIG_CRC_CCITT=y
CONFIG_CRC16=y
CONFIG_CRC_T10DIF=y
CONFIG_CRC_ITU_T=y
CONFIG_CRC32=y
# CONFIG_CRC7 is not set
CONFIG_LIBCRC32C=y
CONFIG_ZLIB_INFLATE=y
CONFIG_ZLIB_DEFLATE=y
CONFIG_PLIST=y
CONFIG_HAS_IOMEM=y
CONFIG_HAS_IOPORT=y
CONFIG_HAS_DMA=y

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Revert "gro: Fix legacy path napi_complete crash", (was: Re: Linux 2.6.29)
  2009-03-24 13:02 ` Revert "gro: Fix legacy path napi_complete crash", (was: Re: Linux 2.6.29) Ingo Molnar
@ 2009-03-24 13:12     ` Ingo Molnar
  2009-03-24 13:35   ` Herbert Xu
  2009-03-24 14:33   ` Robert Schwebel
  2 siblings, 0 replies; 664+ messages in thread
From: Ingo Molnar @ 2009-03-24 13:12 UTC (permalink / raw)
  To: Linus Torvalds, Herbert Xu, Frank Blaschka, David S. Miller,
	Thomas Gleixner, Peter Zijlstra
  Cc: Linux Kernel Mailing List, netdev


(netdev Cc:-ed)

* Ingo Molnar <mingo@elte.hu> wrote:

> Yesterday about half of my testboxes (3 out of 7) started getting 
> weird networking failures: their network interface just got stuck 
> completely - no rx and no tx at all. Restarting the interface did 
> not help.

> I've attached the reproducer (non-SMP) .config. The system has:

Note, the .config is randconfig derived. There was a stage of the 
tests when about every ~5-10th randconfig was failing, so i dont 
think it's a rare config combo that triggers this. (but there were 
other stages where 30 randconfig in a row went fine so it's hard to 
tell.)

In the worst case the hang needed 2 million packets to trigger - 
that's why i set the limit in the tests to 6 million packets.

	Ingo

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Revert "gro: Fix legacy path napi_complete crash", (was: Re: Linux 2.6.29)
@ 2009-03-24 13:12     ` Ingo Molnar
  0 siblings, 0 replies; 664+ messages in thread
From: Ingo Molnar @ 2009-03-24 13:12 UTC (permalink / raw)
  To: Linus Torvalds, Herbert Xu, Frank Blaschka, David S. Miller, Thoma
  Cc: Linux Kernel Mailing List, netdev


(netdev Cc:-ed)

* Ingo Molnar <mingo@elte.hu> wrote:

> Yesterday about half of my testboxes (3 out of 7) started getting 
> weird networking failures: their network interface just got stuck 
> completely - no rx and no tx at all. Restarting the interface did 
> not help.

> I've attached the reproducer (non-SMP) .config. The system has:

Note, the .config is randconfig derived. There was a stage of the 
tests when about every ~5-10th randconfig was failing, so i dont 
think it's a rare config combo that triggers this. (but there were 
other stages where 30 randconfig in a row went fine so it's hard to 
tell.)

In the worst case the hang needed 2 million packets to trigger - 
that's why i set the limit in the tests to 6 million packets.

	Ingo

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-24 10:31           ` Ingo Molnar
  2009-03-24 11:12             ` Andrew Morton
@ 2009-03-24 13:20             ` Theodore Tso
  2009-03-24 13:30               ` Ingo Molnar
                                 ` (4 more replies)
  1 sibling, 5 replies; 664+ messages in thread
From: Theodore Tso @ 2009-03-24 13:20 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Alan Cox, Arjan van de Ven, Andrew Morton, Peter Zijlstra,
	Nick Piggin, Jens Axboe, David Rees, Jesper Krogh,
	Linus Torvalds, Linux Kernel Mailing List

On Tue, Mar 24, 2009 at 11:31:11AM +0100, Ingo Molnar wrote:
> > 
> > "Give kjournald a IOPRIO_CLASS_RT io priority"
> > 
> > October 2007 (yes its that old)
> 
> thx. A more recent submission from Arjan would be:
> 
>     http://lkml.org/lkml/2008/10/1/405
> 
> Resolution was that Tytso indicated it went into some sort of ext4 
> patch queue:
> 
> | I've ported the patch to the ext4 filesystem, and dropped it into 
> | the unstable portion of the ext4 patch queue.
> |
> |   ext4: akpm's locking hack to fix locking delays
> 
> but 6 months down the line and i can find no trace of this upstream 
> anywhere.

Andrew really didn't like Arjan's patch because it forces
non-synchronous writes to have a real-time I/O priority.  He suggested
an alternative approach which I coded up as "akpm's locking hack to
fix locking delays"; unfortunately, it doesn't work.

In ext4, I quietly put in a mount option, journal_ioprio, and set the
default to be slightly higher than the default I/O priority (but no a
real-time class priority) to prevent the write starvation problem.
This definitely helps for some workloads (when some task is reading
enough to starve out the rights).  

More recently (as in this past weekend), I went back to the ext3
problem, and found a better solution, here:

	 http://lkml.org/lkml/2009/3/21/304
	 http://lkml.org/lkml/2009/3/21/302
	 http://lkml.org/lkml/2009/3/21/303

These patches cause the synchronous writes caused by an fsync() to be
submitted using WRITE_SYNC, instead of WRITE, which definitely helps
in the case where there is a heavy read workload in the background.

They don't solve the problem where there is a *huge* amount of writes
going on, though --- if something is dirtying pages at a rate far
greater than the local disk can write it out, say, either "dd
if=/dev/zero of=/mnt/make-lots-of-writes" or a massive distcc cluster
driving a huge amount of data towards a single system or a wget over a
local 100 megabit ethernet from a massive NFS server where everything
is in cache, then you can have a major delay with the fsync().

However, what I've found, though, is that if you're just doing a local
copy from one hard drive to another, or downloading a huge iso file
from an ftp server over a wide area network, the fsync() delays really
don't get *that* bad, even with ext3.  At least, I haven't found a
workload that doesn't involve either dd if=/dev/zero or a massive
amount of data coming in over the network that will cause fsync()
delays in the > 1-2 second category.  Ext3 has been around for a long
time, and it's only been the last couple of years that people have
really complained about this; my theory is that it was the rise of >
10 megabit ethernets and the use of systems like distcc that really
made this problem really become visible.  The only realistic workload
I've found that triggers this requires a fast network dumping data to
a local filesystem.

(I'm sure someone will be ingeniuous enough to find something else
though, and if they're interested, I've attached an fsync latency
tester to this note.  If you find something; let me know, I'd be
interested.)

> <let-me-rant-too>
> 
> The thing is ... this is a _bad_ ext3 design bug affecting ext3 
> users in the last decade or so of ext3 existence. Why is this issue 
> not handled with the utmost high priority and why wasnt it fixed 5 
> years ago already? :-)

OK, so there are a couple of solutions to this problem.  One is to use
ext4 and delayed allocation.  This solves the problem by simply not
allocating the blocks in the first place, so we don't have to force
them out to solve the security problem that data=ordered was trying to
solve.  Simply mounting an ext3 filesystem using ext4, without making
any change to the filesystem format, should solve the problem.

Another is to use the mount option data=writeback.  The whole reason
for forcing the writes out to disk was simply to prevent a security
problem that occurs if your system crashes before the data blocks get
forced out to disk.  This could expose previously written data, which
could belong to another user, and might be his e-mail or p0rn.
Historically, this was always a problem with the BSD Fast Filesystem;
it sync'ed out data every 30 seconds, and metadata every 5 seconds.
(This is where the default ext3 commit interval of 5 seconds, and the
default /proc/sys/vm/dirty_expire_centiseconds came from.)  After a
system crash, it was possible for files written just before the crash
to point to blocks that had not yet been written, and which contain
some other users' data files.  This was the reason for Stephen Tweedie
implementing the data=ordered mode, and making it the default.

However, these days, nearly all Linux boxes are single user machines,
so the security concern is much less of a problem.  So maybe the best
solution for now is to make data=writeback the default.  This solves
the problem too.  The only problem with this is that there are a lot
of sloppy application writers out there, and they've gotten lazy about
using fsync() where it's necessary; combine that with Ubuntu shipping
massively unstable video drivers that crash if you breath on the
system wrong (or exit World of Goo :-), and you've got the problem
which was recently slashdotted, and which I wrote about here:

http://thunk.org/tytso/blog/2009/03/12/delayed-allocation-and-the-zero-length-file-problem/

> It does not matter whether we have extents or htrees when there are 
> _trivially reproducible_ basic usability problems with ext3.

Try ext4, I think you'll like it.  :-)

Failing that, data=writeback for single-user machines is probably your
best bet.

						- Ted

/*
 * fsync-tester.c
 *
 * Written by Theodore Ts'o, 3/21/09.
 *
 * This file may be redistributed under the terms of the GNU Public
 * License, version 2.
 */

#include <unistd.h>
#include <stdlib.h>
#include <stdio.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <time.h>
#include <fcntl.h>
#include <string.h>

#define SIZE (32768*32)

static float timeval_subtract(struct timeval *tv1, struct timeval *tv2)
{
	return ((tv1->tv_sec - tv2->tv_sec) +
		((float) (tv1->tv_usec - tv2->tv_usec)) / 1000000);
}

int main(int argc, char **argv)
{
	int	fd;
	struct timeval tv, tv2;
	char buf[SIZE];

	fd = open("fsync-tester.tst-file", O_WRONLY|O_CREAT);
	if (fd < 0) {
		perror("open");
		exit(1);
	}
	memset(buf, 'a', SIZE);
	while (1) {
		pwrite(fd, buf, SIZE, 0);
		gettimeofday(&tv, NULL);
		fsync(fd);
		gettimeofday(&tv2, NULL);
		printf("fsync time: %5.4f\n", timeval_subtract(&tv2, &tv));
		sleep(1);
	}
}
	


^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-24 13:20             ` Theodore Tso
@ 2009-03-24 13:30               ` Ingo Molnar
  2009-03-24 13:51                 ` Theodore Tso
  2009-03-24 13:52               ` Alan Cox
                                 ` (3 subsequent siblings)
  4 siblings, 1 reply; 664+ messages in thread
From: Ingo Molnar @ 2009-03-24 13:30 UTC (permalink / raw)
  To: Theodore Tso, Alan Cox, Arjan van de Ven, Andrew Morton,
	Peter Zijlstra, Nick Piggin, Jens Axboe, David Rees,
	Jesper Krogh, Linus Torvalds, Linux Kernel Mailing List


* Theodore Tso <tytso@mit.edu> wrote:

> More recently (as in this past weekend), I went back to the ext3 
> problem, and found a better solution, here:
> 
> 	 http://lkml.org/lkml/2009/3/21/304
> 	 http://lkml.org/lkml/2009/3/21/302
> 	 http://lkml.org/lkml/2009/3/21/303
> 
> These patches cause the synchronous writes caused by an fsync() to 
> be submitted using WRITE_SYNC, instead of WRITE, which definitely 
> helps in the case where there is a heavy read workload in the 
> background.
> 
> They don't solve the problem where there is a *huge* amount of 
> writes going on, though --- if something is dirtying pages at a 
> rate far greater than the local disk can write it out, say, either 
> "dd if=/dev/zero of=/mnt/make-lots-of-writes" or a massive distcc 
> cluster driving a huge amount of data towards a single system or a 
> wget over a local 100 megabit ethernet from a massive NFS server 
> where everything is in cache, then you can have a major delay with 
> the fsync().

Nice, thanks for the update! The situation isnt nearly as bleak as i 
feared they are :)

> However, what I've found, though, is that if you're just doing a 
> local copy from one hard drive to another, or downloading a huge 
> iso file from an ftp server over a wide area network, the fsync() 
> delays really don't get *that* bad, even with ext3.  At least, I 
> haven't found a workload that doesn't involve either dd 
> if=/dev/zero or a massive amount of data coming in over the 
> network that will cause fsync() delays in the > 1-2 second 
> category.  Ext3 has been around for a long time, and it's only 
> been the last couple of years that people have really complained 
> about this; my theory is that it was the rise of > 10 megabit 
> ethernets and the use of systems like distcc that really made this 
> problem really become visible.  The only realistic workload I've 
> found that triggers this requires a fast network dumping data to a 
> local filesystem.

i think the problem became visible via the rise in memory size, 
combined with the non-improvement of the performance of rotational 
disks.

The disk speed versus RAM size ratio has become dramatically worse - 
and our "5% of RAM" dirty ratio on a 32 GB box is 1.6 GB - which 
takes an eternity to write out if you happen to sync on that. When 
we had 1 GB of RAM 5% meant 51 MB - one or two seconds to flush out 
- and worse than that, chances are that it's spread out widely on 
the disk, the whole thing becoming seek-limited as well.

That's where the main difference in perception of this problem comes 
from i believe. The problem was always there, but only in the last 
1-2 years did 4G/8G systems become really common for people to 
notice.

SSDs will save us eventually, but they will take up to a decade to 
trickle through for us to forget about this problem altogether.

	Ingo

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Revert "gro: Fix legacy path napi_complete crash", (was: Re: Linux 2.6.29)
  2009-03-24 13:02 ` Revert "gro: Fix legacy path napi_complete crash", (was: Re: Linux 2.6.29) Ingo Molnar
  2009-03-24 13:12     ` Ingo Molnar
@ 2009-03-24 13:35   ` Herbert Xu
  2009-03-24 14:06     ` Ingo Molnar
  2009-03-24 14:33   ` Robert Schwebel
  2 siblings, 1 reply; 664+ messages in thread
From: Herbert Xu @ 2009-03-24 13:35 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, Frank Blaschka, David S. Miller, Thomas Gleixner,
	Peter Zijlstra, Linux Kernel Mailing List

On Tue, Mar 24, 2009 at 02:02:02PM +0100, Ingo Molnar wrote:
> 
> Yesterday about half of my testboxes (3 out of 7) started getting 
> weird networking failures: their network interface just got stuck 
> completely - no rx and no tx at all. Restarting the interface did 
> not help.

Darn, does this patch help?

net: Fix netpoll lockup in legacy receive path

When I fixed the GRO crash in the legacy receive path I used
napi_complete to replace __napi_complete.  Unfortunately they're
not the same when NETPOLL is enabled, which may result in us
not calling __napi_complete at all.

While this is fishy in itself, let's make the obvious fix right
now of reverting to the previous state where we always called
__napi_complete.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

diff --git a/net/core/dev.c b/net/core/dev.c
index e3fe5c7..523f53e 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2580,24 +2580,26 @@ static int process_backlog(struct napi_struct *napi, int quota)
 	int work = 0;
 	struct softnet_data *queue = &__get_cpu_var(softnet_data);
 	unsigned long start_time = jiffies;
+	struct sk_buff *skb;
 
 	napi->weight = weight_p;
 	do {
-		struct sk_buff *skb;
-
 		local_irq_disable();
 		skb = __skb_dequeue(&queue->input_pkt_queue);
-		if (!skb) {
-			local_irq_enable();
-			napi_complete(napi);
-			goto out;
-		}
 		local_irq_enable();
+		if (!skb)
+			break;
 
 		napi_gro_receive(napi, skb);
 	} while (++work < quota && jiffies == start_time);
 
 	napi_gro_flush(napi);
+	if (skb)
+		goto out;
+
+	local_irq_disable();
+	__napi_complete(napi);
+	local_irq_enable();
 
 out:
 	return work;

Thanks,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply related	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-24 11:12             ` Andrew Morton
  2009-03-24 12:23               ` Alan Cox
@ 2009-03-24 13:37               ` Theodore Tso
  2009-03-25 12:37               ` Jan Kara
  2 siblings, 0 replies; 664+ messages in thread
From: Theodore Tso @ 2009-03-24 13:37 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Ingo Molnar, Alan Cox, Arjan van de Ven, Peter Zijlstra,
	Nick Piggin, Jens Axboe, David Rees, Jesper Krogh,
	Linus Torvalds, Linux Kernel Mailing List

On Tue, Mar 24, 2009 at 04:12:49AM -0700, Andrew Morton wrote:
> But mainly because the problem lies elsewhere - in an area of contention
> between the committing and running transactions which we knowingly and
> reluctantly added to fix a bug in "[PATCH] fix ext3 buffer-stealing"

Well, let's be clear here.  The contention between committing and
running transaction is an issue, even if we solved this problem, it
wouldn't solve the issue of fsync() taking a long time in ext3's
data=ordered mode in the case of massive write starvation caused by a
read-heavy workload, or a vast number of dirty buffers associated with
an inode which is about to be committed, and a process triggers an
fsync().  So fixing this issue wouldn't have solved the problem which
Ingo complained about (which was an editor calling fsync() leading to
long delay when saving a file during or right after a
distcc-accelerated kernel compile) or the infamous Firefox 3.0 bug.

Fixing this contention *would* fix the problem where a normal process
which is doing normal file I/O could end up getting stalled
unnecessarily, but that's not what most people are complaining about
--- and shortening the amount of time that it takes do a commit
(either with ext4's delayed allocation or ext3's data=writeback mount
option) would also address this problem.  That doesn't mean that it's
not worth it to fix this particular contention, but there are multiple
issues going on here.

(Basically we're here:

http://www.kernel.org/pub/linux/kernel/people/paulmck/Confessions/FOSSElephant.html

... in Paul Mckenney's version of parable of the blind men and the elephant:

http://www.kernel.org/pub/linux/kernel/people/paulmck/Confessions/

:-)

> Now this:
> 
> > Resolution was that Tytso indicated it went into some sort of ext4 
> > patch queue:
> 
> was not a fix at all.  It was a known-buggy hack which I proposed simply to
> remove that contention point to let us find out if we're on the right
> track.  IIRC Ric was going to ask someone to do some performance testing of
> that hack, but we never heard back.

Ric did do some preliminary performance testing, and it wasn't
encouraging.  It's still in the unstable portion of the ext4 patch
queue, and it's in my "wish I had more time to look at it; I don't get
to work on ext3/4 full-time" queue.

> The bottom line is that someone needs to do some serious rooting through
> the very heart of JBD transaction logic and nobody has yet put their hand
> up.  If we do that, and it turns out to be just too hard to fix then yes,
> perhaps that's the time to start looking at palliative bandaids.

I disagree that they are _just_ palliative bandaids, because you need
these in order to make sure fsync() completes in a reasonable time, so
that people like Ingo don't get cranky.  :-)   Fixing the contention
between the running and committing transaction is a good thing, and I
hope someone puts up their hand or I magically get the time I need to
really dive into the jbd layer, but it won't help the Firefox 3.0
problem or Ingo's problem with saving files during a distcc run.

						- Ted


^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-24 13:30               ` Ingo Molnar
@ 2009-03-24 13:51                 ` Theodore Tso
  2009-03-24 16:34                   ` Jesper Krogh
  2009-03-24 18:20                   ` Mark Lord
  0 siblings, 2 replies; 664+ messages in thread
From: Theodore Tso @ 2009-03-24 13:51 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Alan Cox, Arjan van de Ven, Andrew Morton, Peter Zijlstra,
	Nick Piggin, Jens Axboe, David Rees, Jesper Krogh,
	Linus Torvalds, Linux Kernel Mailing List

On Tue, Mar 24, 2009 at 02:30:11PM +0100, Ingo Molnar wrote:
> i think the problem became visible via the rise in memory size, 
> combined with the non-improvement of the performance of rotational 
> disks.
> 
> The disk speed versus RAM size ratio has become dramatically worse - 
> and our "5% of RAM" dirty ratio on a 32 GB box is 1.6 GB - which 
> takes an eternity to write out if you happen to sync on that. When 
> we had 1 GB of RAM 5% meant 51 MB - one or two seconds to flush out 
> - and worse than that, chances are that it's spread out widely on 
> the disk, the whole thing becoming seek-limited as well.

That's definitely a problem too, but keep in mind that by default the
journal gets committed every 5 seconds, so the data gets flushed out
that often.  So the question is how quickly can you *dirty* 1.6GB of
memory?

"dd if=/dev/zero of=/u1/dirty-me-harder" will certainly do it, but
normally we're doing something useful, and so you're either copying
data from local disk, at which point you're limited by the read speed
of your local disk (I suppose it could be in cache, but how common of
a case is that?), *or*, you're copying from the network, and to copy
in 1.6GB of data in 5 seconds, that means you're moving 320
megabytes/second, which if we're copying in the data from the network,
requires a 10 gigabit ethernet.

Hence my statement that this probably became much more visible with
fast ethernets --- but you're right, the huge increase in memory sizes
was also a key factor; otherwise, write throttling would have kicked
in and the VM would have started pushing the dirty pages to disk much
sooner.

						- Ted

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-24 13:20             ` Theodore Tso
  2009-03-24 13:30               ` Ingo Molnar
@ 2009-03-24 13:52               ` Alan Cox
  2009-03-24 14:28                 ` Theodore Tso
  2009-03-24 17:55                 ` Jan Kara
  2009-03-24 17:55               ` Linus Torvalds
                                 ` (2 subsequent siblings)
  4 siblings, 2 replies; 664+ messages in thread
From: Alan Cox @ 2009-03-24 13:52 UTC (permalink / raw)
  To: Theodore Tso
  Cc: Ingo Molnar, Arjan van de Ven, Andrew Morton, Peter Zijlstra,
	Nick Piggin, Jens Axboe, David Rees, Jesper Krogh,
	Linus Torvalds, Linux Kernel Mailing List

> They don't solve the problem where there is a *huge* amount of writes
> going on, though --- if something is dirtying pages at a rate far

At very high rates other things seem to go pear shaped. I've not traced
it back far enough to be sure but what I suspect occurs from the I/O at
disk level is that two people are writing stuff out at once - presumably
the vm paging pressure and the file system - as I see two streams of I/O
that are each reasonably ordered but are interleaved.

> don't get *that* bad, even with ext3.  At least, I haven't found a
> workload that doesn't involve either dd if=/dev/zero or a massive
> amount of data coming in over the network that will cause fsync()
> delays in the > 1-2 second category.  Ext3 has been around for a long

I see it with a desktop when it pages hard and also when doing heavy
desktop I/O (in my case the repeatable every time case is saving large
images in the gimp - A4 at 600-1200dpi).

The other one (#8636) seems to be a bug in the I/O schedulers as it goes
away if you use a different I/O sched.

> solve.  Simply mounting an ext3 filesystem using ext4, without making
> any change to the filesystem format, should solve the problem.

I will try this experiment but not with production data just yet 8)

> some other users' data files.  This was the reason for Stephen Tweedie
> implementing the data=ordered mode, and making it the default.

Yes and in the server environment or for typical enterprise customers
this is a *big issue*, especially the risk of it being undetected that
they just inadvertently did something like put your medical data into the
end of something public during a crash.

> Try ext4, I think you'll like it.  :-)

I need to, so that I can double check none of the open jbd locking bugs
are there and close more bugzilla entries (#8147)

Thanks for the reply - I hadn't realised a lot of this was getting fixed
but in ext4 and quietly

Alan

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Revert "gro: Fix legacy path napi_complete crash", (was: Re: Linux 2.6.29)
  2009-03-24 13:35   ` Herbert Xu
@ 2009-03-24 14:06     ` Ingo Molnar
  0 siblings, 0 replies; 664+ messages in thread
From: Ingo Molnar @ 2009-03-24 14:06 UTC (permalink / raw)
  To: Herbert Xu
  Cc: Linus Torvalds, Frank Blaschka, David S. Miller, Thomas Gleixner,
	Peter Zijlstra, Linux Kernel Mailing List


* Herbert Xu <herbert@gondor.apana.org.au> wrote:

> On Tue, Mar 24, 2009 at 02:02:02PM +0100, Ingo Molnar wrote:
> > 
> > Yesterday about half of my testboxes (3 out of 7) started getting 
> > weird networking failures: their network interface just got stuck 
> > completely - no rx and no tx at all. Restarting the interface did 
> > not help.
> 
> Darn, does this patch help?
> 
> net: Fix netpoll lockup in legacy receive path

thanks, added it to the test mix. Should know the result fin 1-2 
hours test time.

	Ingo

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-24 13:52               ` Alan Cox
@ 2009-03-24 14:28                 ` Theodore Tso
  2009-03-24 15:18                   ` Alan Cox
  2009-03-24 17:55                 ` Jan Kara
  1 sibling, 1 reply; 664+ messages in thread
From: Theodore Tso @ 2009-03-24 14:28 UTC (permalink / raw)
  To: Alan Cox
  Cc: Ingo Molnar, Arjan van de Ven, Andrew Morton, Peter Zijlstra,
	Nick Piggin, Jens Axboe, David Rees, Jesper Krogh,
	Linus Torvalds, Linux Kernel Mailing List

On Tue, Mar 24, 2009 at 01:52:49PM +0000, Alan Cox wrote:
> 
> At very high rates other things seem to go pear shaped. I've not traced
> it back far enough to be sure but what I suspect occurs from the I/O at
> disk level is that two people are writing stuff out at once - presumably
> the vm paging pressure and the file system - as I see two streams of I/O
> that are each reasonably ordered but are interleaved.

Surely the elevator should have reordered the writes reasonably?  (Or
is that what you meant by "the other one -- #8636 (I assume this is a
kernel Bugzilla #?) seems to be a bug in the I/O schedulers as it goes
away if you use a different I/O sched.?")

> > don't get *that* bad, even with ext3.  At least, I haven't found a
> > workload that doesn't involve either dd if=/dev/zero or a massive
> > amount of data coming in over the network that will cause fsync()
> > delays in the > 1-2 second category.  Ext3 has been around for a long
> 
> I see it with a desktop when it pages hard and also when doing heavy
> desktop I/O (in my case the repeatable every time case is saving large
> images in the gimp - A4 at 600-1200dpi).

Yeah, I could see that doing it.  How big is the image, and out of
curiosity, can you run the fsync-tester.c program I posted while
saving the gimp image, and tell me how much of a delay you end up
seeing?

> > solve.  Simply mounting an ext3 filesystem using ext4, without making
> > any change to the filesystem format, should solve the problem.
> 
> I will try this experiment but not with production data just yet 8)

Where's your bravery, man?  :-)

I've been using it on my laptop since July, and haven't lost
significant amounts of data yet.  (The only thing I did lose was bits
of a git repository fairly early on, and I was able to repair by
replacing the missing objects.)

> > some other users' data files.  This was the reason for Stephen Tweedie
> > implementing the data=ordered mode, and making it the default.
> 
> Yes and in the server environment or for typical enterprise customers
> this is a *big issue*, especially the risk of it being undetected that
> they just inadvertently did something like put your medical data into the
> end of something public during a crash.

True enough; changing the defaults to be data=writeback for the server
environment is probably not a good idea.  (Then again, in the server
environment most of the workloads generally don't end up hitting the
nasty data=ordered failure modes; they tend to be
transaction-oriented, and fsync().)

> > Try ext4, I think you'll like it.  :-)
> 
> I need to, so that I can double check none of the open jbd locking bugs
> are there and close more bugzilla entries (#8147)

More testing would be appreciated --- and yeah, we need to groom the
bugzilla.  For a long time no one in ext3 land was paying attention to
bugzilla, and more recently I've been trying to keep up with the
ext4-related bugs, but I don't get to do ext4 work full-time, and
occasionally Stacey gets annoyed at me when I work late into night...

> Thanks for the reply - I hadn't realised a lot of this was getting fixed
> but in ext4 and quietly

Yeah, there are a bunch of things, like the barrier=1 default, which
akpm has rejected for ext3, but which we've fixed in ext4.  More help
in shaking down the bugs would definitely be appreciated.

   	   	    	       		     - Ted

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Revert "gro: Fix legacy path napi_complete crash", (was: Re: Linux 2.6.29)
  2009-03-24 13:02 ` Revert "gro: Fix legacy path napi_complete crash", (was: Re: Linux 2.6.29) Ingo Molnar
  2009-03-24 13:12     ` Ingo Molnar
  2009-03-24 13:35   ` Herbert Xu
@ 2009-03-24 14:33   ` Robert Schwebel
  2009-03-24 14:39     ` Ingo Molnar
  2 siblings, 1 reply; 664+ messages in thread
From: Robert Schwebel @ 2009-03-24 14:33 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, Herbert Xu, Frank Blaschka, David S. Miller,
	Thomas Gleixner, Peter Zijlstra, Linux Kernel Mailing List,
	kernel

On Tue, Mar 24, 2009 at 02:02:02PM +0100, Ingo Molnar wrote:
> If the box hung within 15 minutes, the kernel was deemed bad. Using 
> that method i arrived to this upstream networking fix which was 
> merged yesterday:
> 
>  | 303c6a0251852ecbdc5c15e466dcaff5971f7517 is first bad commit
>  | commit 303c6a0251852ecbdc5c15e466dcaff5971f7517
>  | Author: Herbert Xu <herbert@gondor.apana.org.au>
>  | Date:   Tue Mar 17 13:11:29 2009 -0700
>  |
>  |     gro: Fix legacy path napi_complete crash

This commit breaks nfsroot booting on i.MX27 and other ARM boxes with
different network cards here in a reproducable way.

rsc
-- 
Pengutronix e.K.                           |                             |
Industrial Linux Solutions                 | http://www.pengutronix.de/  |
Peiner Str. 6-8, 31137 Hildesheim, Germany | Phone: +49-5121-206917-0    |
Amtsgericht Hildesheim, HRA 2686           | Fax:   +49-5121-206917-5555 |

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Revert "gro: Fix legacy path napi_complete crash", (was: Re: Linux 2.6.29)
  2009-03-24 14:33   ` Robert Schwebel
@ 2009-03-24 14:39     ` Ingo Molnar
  2009-03-24 15:09       ` Herbert Xu
  2009-03-24 15:22       ` Revert "gro: Fix legacy path napi_complete crash", (was: Re: Linux 2.6.29) Sascha Hauer
  0 siblings, 2 replies; 664+ messages in thread
From: Ingo Molnar @ 2009-03-24 14:39 UTC (permalink / raw)
  To: Robert Schwebel
  Cc: Linus Torvalds, Herbert Xu, Frank Blaschka, David S. Miller,
	Thomas Gleixner, Peter Zijlstra, Linux Kernel Mailing List,
	kernel


* Robert Schwebel <r.schwebel@pengutronix.de> wrote:

> On Tue, Mar 24, 2009 at 02:02:02PM +0100, Ingo Molnar wrote:
> > If the box hung within 15 minutes, the kernel was deemed bad. Using 
> > that method i arrived to this upstream networking fix which was 
> > merged yesterday:
> > 
> >  | 303c6a0251852ecbdc5c15e466dcaff5971f7517 is first bad commit
> >  | commit 303c6a0251852ecbdc5c15e466dcaff5971f7517
> >  | Author: Herbert Xu <herbert@gondor.apana.org.au>
> >  | Date:   Tue Mar 17 13:11:29 2009 -0700
> >  |
> >  |     gro: Fix legacy path napi_complete crash
> 
> This commit breaks nfsroot booting on i.MX27 and other ARM boxes 
> with different network cards here in a reproducable way.

Can you confirm that Herbert's fix (see it below) solves the 
problem?

	Ingo

--------------->
>From b8b66ac07cab1b45aac93e4f406833a1e0d7677e Mon Sep 17 00:00:00 2001
From: Herbert Xu <herbert@gondor.apana.org.au>
Date: Tue, 24 Mar 2009 21:35:42 +0800
Subject: [PATCH] net: Fix netpoll lockup in legacy receive path

When I fixed the GRO crash in the legacy receive path I used
napi_complete to replace __napi_complete.  Unfortunately they're
not the same when NETPOLL is enabled, which may result in us
not calling __napi_complete at all.

While this is fishy in itself, let's make the obvious fix right
now of reverting to the previous state where we always called
__napi_complete.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Frank Blaschka <blaschka@linux.vnet.ibm.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <20090324133542.GA29046@gondor.apana.org.au>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
 net/core/dev.c |   16 +++++++++-------
 1 files changed, 9 insertions(+), 7 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index e3fe5c7..523f53e 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2580,24 +2580,26 @@ static int process_backlog(struct napi_struct *napi, int quota)
 	int work = 0;
 	struct softnet_data *queue = &__get_cpu_var(softnet_data);
 	unsigned long start_time = jiffies;
+	struct sk_buff *skb;
 
 	napi->weight = weight_p;
 	do {
-		struct sk_buff *skb;
-
 		local_irq_disable();
 		skb = __skb_dequeue(&queue->input_pkt_queue);
-		if (!skb) {
-			local_irq_enable();
-			napi_complete(napi);
-			goto out;
-		}
 		local_irq_enable();
+		if (!skb)
+			break;
 
 		napi_gro_receive(napi, skb);
 	} while (++work < quota && jiffies == start_time);
 
 	napi_gro_flush(napi);
+	if (skb)
+		goto out;
+
+	local_irq_disable();
+	__napi_complete(napi);
+	local_irq_enable();
 
 out:
 	return work;

^ permalink raw reply related	[flat|nested] 664+ messages in thread

* Re: Revert "gro: Fix legacy path napi_complete crash", (was: Re: Linux 2.6.29)
  2009-03-24 14:39     ` Ingo Molnar
@ 2009-03-24 15:09       ` Herbert Xu
  2009-03-24 15:29         ` Sascha Hauer
                           ` (2 more replies)
  2009-03-24 15:22       ` Revert "gro: Fix legacy path napi_complete crash", (was: Re: Linux 2.6.29) Sascha Hauer
  1 sibling, 3 replies; 664+ messages in thread
From: Herbert Xu @ 2009-03-24 15:09 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Robert Schwebel, Linus Torvalds, Frank Blaschka, David S. Miller,
	Thomas Gleixner, Peter Zijlstra, Linux Kernel Mailing List,
	kernel

On Tue, Mar 24, 2009 at 03:39:42PM +0100, Ingo Molnar wrote:
>
> Subject: [PATCH] net: Fix netpoll lockup in legacy receive path

Actually, this patch is still racy.  If some interrupt comes in
and we suddenly get the maximum amount of backlog we can still
hang when we call __napi_complete incorrectly.  It's unlikely
but we certainly shouldn't allow that.  Here's a better version.

net: Fix netpoll lockup in legacy receive path

When I fixed the GRO crash in the legacy receive path I used
napi_complete to replace __napi_complete.  Unfortunately they're
not the same when NETPOLL is enabled, which may result in us
not calling __napi_complete at all.

What's more, we really do need to keep the __napi_complete call
within the IRQ-off section since in theory an IRQ can occur in
between and fill up the backlog to the maximum, causing us to
lock up.

This patch fixes this by essentially open-coding __napi_complete.

Note we no longer need the memory barrier because this function
is per-cpu.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

diff --git a/net/core/dev.c b/net/core/dev.c
index e3fe5c7..2a7f6b3 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2588,9 +2588,10 @@ static int process_backlog(struct napi_struct *napi, int quota)
 		local_irq_disable();
 		skb = __skb_dequeue(&queue->input_pkt_queue);
 		if (!skb) {
+			list_del(&napi->poll_list);
+			clear_bit(NAPI_STATE_SCHED, &napi->state);
 			local_irq_enable();
-			napi_complete(napi);
-			goto out;
+			break;
 		}
 		local_irq_enable();
 
@@ -2599,7 +2600,6 @@ static int process_backlog(struct napi_struct *napi, int quota)
 
 	napi_gro_flush(napi);
 
-out:
 	return work;
 }

Thanks,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply related	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-24 14:28                 ` Theodore Tso
@ 2009-03-24 15:18                   ` Alan Cox
  0 siblings, 0 replies; 664+ messages in thread
From: Alan Cox @ 2009-03-24 15:18 UTC (permalink / raw)
  To: Theodore Tso
  Cc: Ingo Molnar, Arjan van de Ven, Andrew Morton, Peter Zijlstra,
	Nick Piggin, Jens Axboe, David Rees, Jesper Krogh,
	Linus Torvalds, Linux Kernel Mailing List

> Surely the elevator should have reordered the writes reasonably?  (Or
> is that what you meant by "the other one -- #8636 (I assume this is a
> kernel Bugzilla #?) seems to be a bug in the I/O schedulers as it goes
> away if you use a different I/O sched.?")

There are two cases there. One is a bug #8636 (kernel bugzilla) which is
where things like dump show awful performance with certain I/O scheduler
settings. That seems to be totally not connected to the fs but it is a
problem (and has a patch)

The second one the elevator is clearly trying to sort out but its
behaving as if someone is writing the file starting at say 0 and someone
else is trying to write it back starting some large distance further down
the file. The elevator can only do so much then.

> Yeah, I could see that doing it.  How big is the image, and out of
> curiosity, can you run the fsync-tester.c program I posted while

150MB+ for the pnm files from gimp used as temporaries by Eve (Etch
Validation Engine), more like 10MB for xcf/tif files.

> saving the gimp image, and tell me how much of a delay you end up
> seeing?

Added to the TODO list once I can set up a suitable test box (my new dev
box is somewhere between Dell and my desk right now)

> More testing would be appreciated --- and yeah, we need to groom the
> bugzilla.  

I'm currently doing this on a large scale (closed about 300 so far this
run). Bug 8147 might be worth a look as its a case where the jbd locking
and the jbd comments seem to disagree (the comments say you must hold a
lock but we don't seem to)

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Revert "gro: Fix legacy path napi_complete crash", (was: Re: Linux 2.6.29)
  2009-03-24 14:39     ` Ingo Molnar
  2009-03-24 15:09       ` Herbert Xu
@ 2009-03-24 15:22       ` Sascha Hauer
  1 sibling, 0 replies; 664+ messages in thread
From: Sascha Hauer @ 2009-03-24 15:22 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Robert Schwebel, Linus Torvalds, Herbert Xu, Frank Blaschka,
	David S. Miller, Thomas Gleixner, Peter Zijlstra,
	Linux Kernel Mailing List, kernel

Hi Ingo,

On Tue, Mar 24, 2009 at 03:39:42PM +0100, Ingo Molnar wrote:
> 
> * Robert Schwebel <r.schwebel@pengutronix.de> wrote:
> 
> > On Tue, Mar 24, 2009 at 02:02:02PM +0100, Ingo Molnar wrote:
> > > If the box hung within 15 minutes, the kernel was deemed bad. Using 
> > > that method i arrived to this upstream networking fix which was 
> > > merged yesterday:
> > > 
> > >  | 303c6a0251852ecbdc5c15e466dcaff5971f7517 is first bad commit
> > >  | commit 303c6a0251852ecbdc5c15e466dcaff5971f7517
> > >  | Author: Herbert Xu <herbert@gondor.apana.org.au>
> > >  | Date:   Tue Mar 17 13:11:29 2009 -0700
> > >  |
> > >  |     gro: Fix legacy path napi_complete crash
> > 
> > This commit breaks nfsroot booting on i.MX27 and other ARM boxes 
> > with different network cards here in a reproducable way.
> 
> Can you confirm that Herbert's fix (see it below) solves the 
> problem?

No, still doesn't work.

It seems to have something to do with enabling interrupts between
__skb_dequeue() and __napi_complete().

I reverted 303c6a0251852ecbdc5c15e466dcaff5971f7517 and added a

local_irq_enable(); local_irq_disable();

right before __napi_complete() and this already breaks networking.


Sascha

> 
> 	Ingo
> 
> --------------->
> From b8b66ac07cab1b45aac93e4f406833a1e0d7677e Mon Sep 17 00:00:00 2001
> From: Herbert Xu <herbert@gondor.apana.org.au>
> Date: Tue, 24 Mar 2009 21:35:42 +0800
> Subject: [PATCH] net: Fix netpoll lockup in legacy receive path
> 
> When I fixed the GRO crash in the legacy receive path I used
> napi_complete to replace __napi_complete.  Unfortunately they're
> not the same when NETPOLL is enabled, which may result in us
> not calling __napi_complete at all.
> 
> While this is fishy in itself, let's make the obvious fix right
> now of reverting to the previous state where we always called
> __napi_complete.
> 
> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
> Cc: Linus Torvalds <torvalds@linux-foundation.org>
> Cc: Frank Blaschka <blaschka@linux.vnet.ibm.com>
> Cc: "David S. Miller" <davem@davemloft.net>
> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
> LKML-Reference: <20090324133542.GA29046@gondor.apana.org.au>
> Signed-off-by: Ingo Molnar <mingo@elte.hu>
> ---
>  net/core/dev.c |   16 +++++++++-------
>  1 files changed, 9 insertions(+), 7 deletions(-)
> 
> diff --git a/net/core/dev.c b/net/core/dev.c
> index e3fe5c7..523f53e 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -2580,24 +2580,26 @@ static int process_backlog(struct napi_struct *napi, int quota)
>  	int work = 0;
>  	struct softnet_data *queue = &__get_cpu_var(softnet_data);
>  	unsigned long start_time = jiffies;
> +	struct sk_buff *skb;
>  
>  	napi->weight = weight_p;
>  	do {
> -		struct sk_buff *skb;
> -
>  		local_irq_disable();
>  		skb = __skb_dequeue(&queue->input_pkt_queue);
> -		if (!skb) {
> -			local_irq_enable();
> -			napi_complete(napi);
> -			goto out;
> -		}
>  		local_irq_enable();
> +		if (!skb)
> +			break;
>  
>  		napi_gro_receive(napi, skb);
>  	} while (++work < quota && jiffies == start_time);
>  
>  	napi_gro_flush(napi);
> +	if (skb)
> +		goto out;
> +
> +	local_irq_disable();
> +	__napi_complete(napi);
> +	local_irq_enable();
>  
>  out:
>  	return work;
> 

-- 
Pengutronix e.K.                           |                             |
Industrial Linux Solutions                 | http://www.pengutronix.de/  |
Peiner Str. 6-8, 31137 Hildesheim, Germany | Phone: +49-5121-206917-0    |
Amtsgericht Hildesheim, HRA 2686           | Fax:   +49-5121-206917-5555 |

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Revert "gro: Fix legacy path napi_complete crash", (was: Re: Linux 2.6.29)
  2009-03-24 15:09       ` Herbert Xu
@ 2009-03-24 15:29         ` Sascha Hauer
  2009-03-24 15:36         ` Ingo Molnar
  2009-03-24 21:36         ` David Miller
  2 siblings, 0 replies; 664+ messages in thread
From: Sascha Hauer @ 2009-03-24 15:29 UTC (permalink / raw)
  To: Herbert Xu
  Cc: Ingo Molnar, Robert Schwebel, Linus Torvalds, Frank Blaschka,
	David S. Miller, Thomas Gleixner, Peter Zijlstra,
	Linux Kernel Mailing List, kernel

On Tue, Mar 24, 2009 at 11:09:28PM +0800, Herbert Xu wrote:
> On Tue, Mar 24, 2009 at 03:39:42PM +0100, Ingo Molnar wrote:
> >
> > Subject: [PATCH] net: Fix netpoll lockup in legacy receive path
> 
> Actually, this patch is still racy.  If some interrupt comes in
> and we suddenly get the maximum amount of backlog we can still
> hang when we call __napi_complete incorrectly.  It's unlikely
> but we certainly shouldn't allow that.  Here's a better version.

Yes, this one works. I always wanted to give a

Tested-by: Sascha Hauer <s.hauer@pengutronix.de>

Thanks
  Sascha

> 
> net: Fix netpoll lockup in legacy receive path
> 
> When I fixed the GRO crash in the legacy receive path I used
> napi_complete to replace __napi_complete.  Unfortunately they're
> not the same when NETPOLL is enabled, which may result in us
> not calling __napi_complete at all.
> 
> What's more, we really do need to keep the __napi_complete call
> within the IRQ-off section since in theory an IRQ can occur in
> between and fill up the backlog to the maximum, causing us to
> lock up.
> 
> This patch fixes this by essentially open-coding __napi_complete.
> 
> Note we no longer need the memory barrier because this function
> is per-cpu.
> 
> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
> 
> diff --git a/net/core/dev.c b/net/core/dev.c
> index e3fe5c7..2a7f6b3 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -2588,9 +2588,10 @@ static int process_backlog(struct napi_struct *napi, int quota)
>  		local_irq_disable();
>  		skb = __skb_dequeue(&queue->input_pkt_queue);
>  		if (!skb) {
> +			list_del(&napi->poll_list);
> +			clear_bit(NAPI_STATE_SCHED, &napi->state);
>  			local_irq_enable();
> -			napi_complete(napi);
> -			goto out;
> +			break;
>  		}
>  		local_irq_enable();
>  
> @@ -2599,7 +2600,6 @@ static int process_backlog(struct napi_struct *napi, int quota)
>  
>  	napi_gro_flush(napi);
>  
> -out:
>  	return work;
>  }
> 
> Thanks,
> -- 
> Visit Openswan at http://www.openswan.org/
> Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
> Home Page: http://gondor.apana.org.au/~herbert/
> PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
> 

-- 
Pengutronix e.K.                           |                             |
Industrial Linux Solutions                 | http://www.pengutronix.de/  |
Peiner Str. 6-8, 31137 Hildesheim, Germany | Phone: +49-5121-206917-0    |
Amtsgericht Hildesheim, HRA 2686           | Fax:   +49-5121-206917-5555 |

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Revert "gro: Fix legacy path napi_complete crash", (was: Re: Linux 2.6.29)
  2009-03-24 15:09       ` Herbert Xu
  2009-03-24 15:29         ` Sascha Hauer
@ 2009-03-24 15:36         ` Ingo Molnar
  2009-03-24 15:47           ` Ingo Molnar
  2009-03-24 21:36         ` David Miller
  2 siblings, 1 reply; 664+ messages in thread
From: Ingo Molnar @ 2009-03-24 15:36 UTC (permalink / raw)
  To: Herbert Xu
  Cc: Robert Schwebel, Linus Torvalds, Frank Blaschka, David S. Miller,
	Thomas Gleixner, Peter Zijlstra, Linux Kernel Mailing List,
	kernel


* Herbert Xu <herbert@gondor.apana.org.au> wrote:

> On Tue, Mar 24, 2009 at 03:39:42PM +0100, Ingo Molnar wrote:
> >
> > Subject: [PATCH] net: Fix netpoll lockup in legacy receive path
> 
> Actually, this patch is still racy.  If some interrupt comes in
> and we suddenly get the maximum amount of backlog we can still
> hang when we call __napi_complete incorrectly.  It's unlikely
> but we certainly shouldn't allow that.  Here's a better version.
> 
> net: Fix netpoll lockup in legacy receive path

ok - i'm testing with this now.

	Ingo

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Revert "gro: Fix legacy path napi_complete crash", (was: Re: Linux 2.6.29)
  2009-03-24 15:36         ` Ingo Molnar
@ 2009-03-24 15:47           ` Ingo Molnar
  2009-03-24 15:59             ` Herbert Xu
  0 siblings, 1 reply; 664+ messages in thread
From: Ingo Molnar @ 2009-03-24 15:47 UTC (permalink / raw)
  To: Herbert Xu
  Cc: Robert Schwebel, Linus Torvalds, Frank Blaschka, David S. Miller,
	Thomas Gleixner, Peter Zijlstra, Linux Kernel Mailing List,
	kernel


* Ingo Molnar <mingo@elte.hu> wrote:

> 
> * Herbert Xu <herbert@gondor.apana.org.au> wrote:
> 
> > On Tue, Mar 24, 2009 at 03:39:42PM +0100, Ingo Molnar wrote:
> > >
> > > Subject: [PATCH] net: Fix netpoll lockup in legacy receive path
> > 
> > Actually, this patch is still racy.  If some interrupt comes in
> > and we suddenly get the maximum amount of backlog we can still
> > hang when we call __napi_complete incorrectly.  It's unlikely
> > but we certainly shouldn't allow that.  Here's a better version.
> > 
> > net: Fix netpoll lockup in legacy receive path
> 
> ok - i'm testing with this now.

test failure on one of the boxes, interface got stuck after ~100K 
packets:

eth1      Link encap:Ethernet  HWaddr 00:13:D4:DC:41:12  
          inet addr:10.0.1.13  Bcast:10.0.1.255  Mask:255.255.255.0
          inet6 addr: fe80::213:d4ff:fedc:4112/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:22555 errors:0 dropped:0 overruns:0 frame:0
          TX packets:1897 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:2435071 (2.3 MiB)  TX bytes:503790 (491.9 KiB)
          Interrupt:11 Base address:0x4000 

i'm going back to your previous version for now - it might still be 
racy but it worked well for about 1.5 hours of test-time.

	Ingo

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Revert "gro: Fix legacy path napi_complete crash", (was: Re: Linux 2.6.29)
  2009-03-24 15:47           ` Ingo Molnar
@ 2009-03-24 15:59             ` Herbert Xu
  2009-03-24 16:02               ` Ingo Molnar
  0 siblings, 1 reply; 664+ messages in thread
From: Herbert Xu @ 2009-03-24 15:59 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Robert Schwebel, Linus Torvalds, Frank Blaschka, David S. Miller,
	Thomas Gleixner, Peter Zijlstra, Linux Kernel Mailing List,
	kernel

On Tue, Mar 24, 2009 at 04:47:17PM +0100, Ingo Molnar wrote:
> 
> test failure on one of the boxes, interface got stuck after ~100K 
> packets:
> 
> eth1      Link encap:Ethernet  HWaddr 00:13:D4:DC:41:12  
>           inet addr:10.0.1.13  Bcast:10.0.1.255  Mask:255.255.255.0
>           inet6 addr: fe80::213:d4ff:fedc:4112/64 Scope:Link
>           UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
>           RX packets:22555 errors:0 dropped:0 overruns:0 frame:0
>           TX packets:1897 errors:0 dropped:0 overruns:0 carrier:0
>           collisions:0 txqueuelen:1000 
>           RX bytes:2435071 (2.3 MiB)  TX bytes:503790 (491.9 KiB)
>           Interrupt:11 Base address:0x4000 

What's the NIC and config on this one? If it's still using the
legacy/netif_rx path, where GRO is off by default, this patch
should make it exactly the same as with my original patch reverted.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Revert "gro: Fix legacy path napi_complete crash", (was: Re: Linux 2.6.29)
  2009-03-24 15:59             ` Herbert Xu
@ 2009-03-24 16:02               ` Ingo Molnar
  2009-03-24 19:19                 ` Ingo Molnar
  2009-03-25  0:32                 ` Herbert Xu
  0 siblings, 2 replies; 664+ messages in thread
From: Ingo Molnar @ 2009-03-24 16:02 UTC (permalink / raw)
  To: Herbert Xu
  Cc: Robert Schwebel, Linus Torvalds, Frank Blaschka, David S. Miller,
	Thomas Gleixner, Peter Zijlstra, Linux Kernel Mailing List,
	kernel

[-- Attachment #1: Type: text/plain, Size: 1145 bytes --]


* Herbert Xu <herbert@gondor.apana.org.au> wrote:

> On Tue, Mar 24, 2009 at 04:47:17PM +0100, Ingo Molnar wrote:
> > 
> > test failure on one of the boxes, interface got stuck after ~100K 
> > packets:
> > 
> > eth1      Link encap:Ethernet  HWaddr 00:13:D4:DC:41:12  
> >           inet addr:10.0.1.13  Bcast:10.0.1.255  Mask:255.255.255.0
> >           inet6 addr: fe80::213:d4ff:fedc:4112/64 Scope:Link
> >           UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
> >           RX packets:22555 errors:0 dropped:0 overruns:0 frame:0
> >           TX packets:1897 errors:0 dropped:0 overruns:0 carrier:0
> >           collisions:0 txqueuelen:1000 
> >           RX bytes:2435071 (2.3 MiB)  TX bytes:503790 (491.9 KiB)
> >           Interrupt:11 Base address:0x4000 
> 
> What's the NIC and config on this one? If it's still using the 
> legacy/netif_rx path, where GRO is off by default, this patch 
> should make it exactly the same as with my original patch 
> reverted.

Same forcedeth box i reported before. Config below. (note: if you 
want to use it you need to run it through 'make oldconfig', with all 
defaults accepted)

	Ingo

[-- Attachment #2: config-Tue_Mar_24_15_24_33_CET_2009.bad --]
[-- Type: text/plain, Size: 65002 bytes --]

# ac10134a
#
# Automatically generated make config: don't edit
# Linux kernel version: 2.6.29
# Tue Mar 24 15:24:33 2009
#
# CONFIG_64BIT is not set
CONFIG_X86_32=y
# CONFIG_X86_64 is not set
CONFIG_X86=y
CONFIG_ARCH_DEFCONFIG="arch/x86/configs/i386_defconfig"
CONFIG_GENERIC_TIME=y
CONFIG_GENERIC_CMOS_UPDATE=y
CONFIG_CLOCKSOURCE_WATCHDOG=y
CONFIG_GENERIC_CLOCKEVENTS=y
CONFIG_GENERIC_CLOCKEVENTS_BROADCAST=y
CONFIG_LOCKDEP_SUPPORT=y
CONFIG_STACKTRACE_SUPPORT=y
CONFIG_HAVE_LATENCYTOP_SUPPORT=y
CONFIG_FAST_CMPXCHG_LOCAL=y
CONFIG_MMU=y
CONFIG_ZONE_DMA=y
CONFIG_GENERIC_ISA_DMA=y
CONFIG_GENERIC_IOMAP=y
CONFIG_GENERIC_HWEIGHT=y
CONFIG_GENERIC_GPIO=y
CONFIG_ARCH_MAY_HAVE_PC_FDC=y
# CONFIG_RWSEM_GENERIC_SPINLOCK is not set
CONFIG_RWSEM_XCHGADD_ALGORITHM=y
CONFIG_ARCH_HAS_CPU_IDLE_WAIT=y
CONFIG_GENERIC_CALIBRATE_DELAY=y
# CONFIG_GENERIC_TIME_VSYSCALL is not set
CONFIG_ARCH_HAS_CPU_RELAX=y
CONFIG_ARCH_HAS_DEFAULT_IDLE=y
CONFIG_ARCH_HAS_CACHE_LINE_SIZE=y
CONFIG_HAVE_SETUP_PER_CPU_AREA=y
CONFIG_HAVE_DYNAMIC_PER_CPU_AREA=y
# CONFIG_HAVE_CPUMASK_OF_CPU_MAP is not set
CONFIG_ARCH_HIBERNATION_POSSIBLE=y
CONFIG_ARCH_SUSPEND_POSSIBLE=y
# CONFIG_ZONE_DMA32 is not set
CONFIG_ARCH_POPULATES_NODE_MAP=y
# CONFIG_AUDIT_ARCH is not set
CONFIG_ARCH_SUPPORTS_OPTIMIZED_INLINING=y
CONFIG_GENERIC_HARDIRQS=y
CONFIG_GENERIC_HARDIRQS_NO__DO_IRQ=y
CONFIG_GENERIC_IRQ_PROBE=y
CONFIG_GENERIC_PENDING_IRQ=y
CONFIG_USE_GENERIC_SMP_HELPERS=y
CONFIG_X86_32_SMP=y
CONFIG_X86_HT=y
CONFIG_X86_TRAMPOLINE=y
CONFIG_X86_32_LAZY_GS=y
CONFIG_KTIME_SCALAR=y
# CONFIG_BOOTPARAM_SUPPORT_NOT_WANTED is not set
CONFIG_BOOTPARAM_SUPPORT=y
CONFIG_DEFCONFIG_LIST="/lib/modules/$UNAME_RELEASE/.config"

#
# General setup
#
CONFIG_EXPERIMENTAL=y
CONFIG_BROKEN_BOOT_ALLOWED4=y
# CONFIG_BROKEN_BOOT_ALLOWED3 is not set
CONFIG_BROKEN_BOOT_EUROPE=y
CONFIG_BROKEN_BOOT_TITAN=y
CONFIG_LOCK_KERNEL=y
CONFIG_INIT_ENV_ARG_LIMIT=32
CONFIG_LOCALVERSION=""
# CONFIG_LOCALVERSION_AUTO is not set
CONFIG_HAVE_KERNEL_GZIP=y
CONFIG_HAVE_KERNEL_BZIP2=y
CONFIG_HAVE_KERNEL_LZMA=y
# CONFIG_KERNEL_GZIP is not set
CONFIG_KERNEL_BZIP2=y
# CONFIG_KERNEL_LZMA is not set
# CONFIG_SWAP is not set
CONFIG_SYSVIPC=y
CONFIG_SYSVIPC_SYSCTL=y
# CONFIG_POSIX_MQUEUE is not set
CONFIG_BSD_PROCESS_ACCT=y
CONFIG_BSD_PROCESS_ACCT_V3=y
CONFIG_TASKSTATS=y
CONFIG_TASK_DELAY_ACCT=y
# CONFIG_TASK_XACCT is not set
CONFIG_AUDIT=y
# CONFIG_AUDITSYSCALL is not set

#
# RCU Subsystem
#
CONFIG_CLASSIC_RCU=y
# CONFIG_TREE_RCU is not set
# CONFIG_PREEMPT_RCU is not set
# CONFIG_TREE_RCU_TRACE is not set
# CONFIG_PREEMPT_RCU_TRACE is not set
CONFIG_IKCONFIG=y
CONFIG_IKCONFIG_PROC=y
CONFIG_LOG_BUF_SHIFT=20
CONFIG_HAVE_UNSTABLE_SCHED_CLOCK=y
CONFIG_GROUP_SCHED=y
CONFIG_FAIR_GROUP_SCHED=y
CONFIG_RT_GROUP_SCHED=y
CONFIG_USER_SCHED=y
# CONFIG_CGROUP_SCHED is not set
CONFIG_CGROUPS=y
CONFIG_CGROUP_DEBUG=y
# CONFIG_CGROUP_NS is not set
CONFIG_CGROUP_FREEZER=y
# CONFIG_CGROUP_DEVICE is not set
# CONFIG_CPUSETS is not set
CONFIG_CGROUP_CPUACCT=y
# CONFIG_RESOURCE_COUNTERS is not set
CONFIG_SYSFS_DEPRECATED=y
CONFIG_SYSFS_DEPRECATED_V2=y
# CONFIG_RELAY is not set
# CONFIG_NAMESPACES is not set
CONFIG_BLK_DEV_INITRD=y
CONFIG_INITRAMFS_SOURCE=""
CONFIG_RD_GZIP=y
# CONFIG_RD_BZIP2 is not set
# CONFIG_RD_LZMA is not set
CONFIG_INITRAMFS_COMPRESSION_NONE=y
# CONFIG_INITRAMFS_COMPRESSION_GZIP is not set
# CONFIG_INITRAMFS_COMPRESSION_BZIP2 is not set
# CONFIG_INITRAMFS_COMPRESSION_LZMA is not set
# CONFIG_CC_OPTIMIZE_FOR_SIZE is not set
CONFIG_SYSCTL=y
CONFIG_ANON_INODES=y
CONFIG_EMBEDDED=y
# CONFIG_UID16 is not set
CONFIG_SYSCTL_SYSCALL=y
CONFIG_KALLSYMS=y
CONFIG_KALLSYMS_ALL=y
CONFIG_KALLSYMS_EXTRA_PASS=y
CONFIG_HOTPLUG=y
CONFIG_PRINTK=y
# CONFIG_BUG is not set
CONFIG_ELF_CORE=y
# CONFIG_PCSPKR_PLATFORM is not set
CONFIG_BASE_FULL=y
CONFIG_FUTEX=y
CONFIG_EPOLL=y
CONFIG_SIGNALFD=y
# CONFIG_TIMERFD is not set
# CONFIG_EVENTFD is not set
CONFIG_SHMEM=y
# CONFIG_AIO is not set
CONFIG_HAVE_PERF_COUNTERS=y

#
# Performance Counters
#
CONFIG_PERF_COUNTERS=y
# CONFIG_EVENT_PROFILE is not set
CONFIG_VM_EVENT_COUNTERS=y
CONFIG_PCI_QUIRKS=y
# CONFIG_SLUB_DEBUG is not set
# CONFIG_COMPAT_BRK is not set
# CONFIG_SLAB is not set
CONFIG_SLUB=y
# CONFIG_SLOB is not set
CONFIG_PROFILING=y
CONFIG_TRACEPOINTS=y
CONFIG_MARKERS=y
CONFIG_OPROFILE=m
CONFIG_OPROFILE_IBS=y
CONFIG_HAVE_OPROFILE=y
CONFIG_KPROBES=y
CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS=y
CONFIG_KRETPROBES=y
CONFIG_HAVE_IOREMAP_PROT=y
CONFIG_HAVE_KPROBES=y
CONFIG_HAVE_KRETPROBES=y
CONFIG_HAVE_ARCH_TRACEHOOK=y
CONFIG_HAVE_DMA_API_DEBUG=y
CONFIG_HAVE_GENERIC_DMA_COHERENT=y
CONFIG_RT_MUTEXES=y
CONFIG_BASE_SMALL=0
CONFIG_MODULES=y
CONFIG_MODULE_FORCE_LOAD=y
CONFIG_MODULE_UNLOAD=y
# CONFIG_MODULE_FORCE_UNLOAD is not set
# CONFIG_MODVERSIONS is not set
CONFIG_MODULE_SRCVERSION_ALL=y
CONFIG_STOP_MACHINE=y
CONFIG_BLOCK=y
# CONFIG_LBD is not set
CONFIG_BLK_DEV_BSG=y
# CONFIG_BLK_DEV_INTEGRITY is not set

#
# IO Schedulers
#
CONFIG_IOSCHED_NOOP=y
CONFIG_IOSCHED_AS=y
CONFIG_IOSCHED_DEADLINE=m
CONFIG_IOSCHED_CFQ=m
# CONFIG_DEFAULT_AS is not set
# CONFIG_DEFAULT_DEADLINE is not set
# CONFIG_DEFAULT_CFQ is not set
CONFIG_DEFAULT_NOOP=y
CONFIG_DEFAULT_IOSCHED="noop"
CONFIG_PREEMPT_NOTIFIERS=y
CONFIG_FREEZER=y

#
# Processor type and features
#
CONFIG_TICK_ONESHOT=y
CONFIG_NO_HZ=y
CONFIG_BOOTPARAM_NO_HZ_OFF=y
# CONFIG_HIGH_RES_TIMERS is not set
CONFIG_GENERIC_CLOCKEVENTS_BUILD=y
CONFIG_BOOTPARAM_HIGHRES_OFF=y
CONFIG_SMP_SUPPORT=y
CONFIG_SPARSE_IRQ=y
CONFIG_X86_MPPARSE=y
CONFIG_X86_BIGSMP=y
# CONFIG_X86_EXTENDED_PLATFORM is not set
CONFIG_SCHED_OMIT_FRAME_POINTER=y
# CONFIG_PARAVIRT_GUEST is not set
CONFIG_MEMTEST=y
# CONFIG_M386 is not set
# CONFIG_M486 is not set
# CONFIG_M586 is not set
# CONFIG_M586TSC is not set
# CONFIG_M586MMX is not set
# CONFIG_M686 is not set
# CONFIG_MPENTIUMII is not set
# CONFIG_MPENTIUMIII is not set
CONFIG_MPENTIUMM=y
# CONFIG_MPENTIUM4 is not set
# CONFIG_MK6 is not set
# CONFIG_MK7 is not set
# CONFIG_MK8 is not set
# CONFIG_MCRUSOE is not set
# CONFIG_MEFFICEON is not set
# CONFIG_MWINCHIPC6 is not set
# CONFIG_MWINCHIP3D is not set
# CONFIG_MGEODEGX1 is not set
# CONFIG_MGEODE_LX is not set
# CONFIG_MCYRIXIII is not set
# CONFIG_MVIAC3_2 is not set
# CONFIG_MVIAC7 is not set
# CONFIG_MPSC is not set
# CONFIG_MCORE2 is not set
# CONFIG_GENERIC_CPU is not set
CONFIG_X86_GENERIC=y
CONFIG_X86_CPU=y
CONFIG_X86_L1_CACHE_BYTES=64
CONFIG_X86_INTERNODE_CACHE_BYTES=64
CONFIG_X86_CMPXCHG=y
CONFIG_X86_L1_CACHE_SHIFT=6
CONFIG_X86_XADD=y
CONFIG_X86_WP_WORKS_OK=y
CONFIG_X86_INVLPG=y
CONFIG_X86_BSWAP=y
CONFIG_X86_POPAD_OK=y
CONFIG_X86_INTEL_USERCOPY=y
CONFIG_X86_USE_PPRO_CHECKSUM=y
CONFIG_X86_TSC=y
CONFIG_X86_CMOV=y
CONFIG_X86_MINIMUM_CPU_FAMILY=4
CONFIG_X86_DEBUGCTLMSR=y
CONFIG_PROCESSOR_SELECT=y
CONFIG_CPU_SUP_INTEL=y
CONFIG_CPU_SUP_CYRIX_32=y
CONFIG_CPU_SUP_AMD=y
CONFIG_CPU_SUP_CENTAUR=y
# CONFIG_CPU_SUP_TRANSMETA_32 is not set
# CONFIG_CPU_SUP_UMC_32 is not set
# CONFIG_X86_DS is not set
# CONFIG_X86_PTRACE_BTS is not set
# CONFIG_HPET_TIMER is not set
CONFIG_DMI=y
# CONFIG_IOMMU_HELPER is not set
# CONFIG_IOMMU_API is not set
CONFIG_NR_CPUS=32
CONFIG_SCHED_SMT=y
# CONFIG_SCHED_MC is not set
CONFIG_PREEMPT_NONE=y
# CONFIG_PREEMPT_VOLUNTARY is not set
# CONFIG_PREEMPT is not set
CONFIG_X86_LOCAL_APIC=y
CONFIG_X86_IO_APIC=y
CONFIG_X86_REROUTE_FOR_BROKEN_BOOT_IRQS=y
# CONFIG_X86_MCE is not set
CONFIG_VM86=y
CONFIG_I8K=y
CONFIG_X86_REBOOTFIXUPS=y
# CONFIG_MICROCODE is not set
CONFIG_X86_MSR=m
CONFIG_X86_CPUID=m
CONFIG_X86_CPU_DEBUG=m
# CONFIG_UP_WANTED_1 is not set
CONFIG_SMP=y
# CONFIG_NOHIGHMEM is not set
CONFIG_HIGHMEM4G=y
# CONFIG_HIGHMEM64G is not set
# CONFIG_VMSPLIT_3G is not set
# CONFIG_VMSPLIT_3G_OPT is not set
CONFIG_VMSPLIT_2G=y
# CONFIG_VMSPLIT_2G_OPT is not set
# CONFIG_VMSPLIT_1G is not set
CONFIG_PAGE_OFFSET=0x80000000
CONFIG_HIGHMEM=y
# CONFIG_ARCH_PHYS_ADDR_T_64BIT is not set
CONFIG_NEED_NODE_MEMMAP_SIZE=y
CONFIG_ARCH_FLATMEM_ENABLE=y
CONFIG_ARCH_SPARSEMEM_ENABLE=y
CONFIG_ARCH_SELECT_MEMORY_MODEL=y
# CONFIG_ARCH_MEMORY_PROBE is not set
CONFIG_ILLEGAL_POINTER_VALUE=0
CONFIG_SELECT_MEMORY_MODEL=y
# CONFIG_FLATMEM_MANUAL is not set
# CONFIG_DISCONTIGMEM_MANUAL is not set
CONFIG_SPARSEMEM_MANUAL=y
CONFIG_SPARSEMEM=y
CONFIG_HAVE_MEMORY_PRESENT=y
CONFIG_SPARSEMEM_STATIC=y
CONFIG_MEMORY_HOTPLUG=y
CONFIG_MEMORY_HOTPLUG_SPARSE=y
CONFIG_MEMORY_HOTREMOVE=y
CONFIG_PAGEFLAGS_EXTENDED=y
CONFIG_SPLIT_PTLOCK_CPUS=4
CONFIG_MIGRATION=y
# CONFIG_PHYS_ADDR_T_64BIT is not set
CONFIG_ZONE_DMA_FLAG=1
CONFIG_BOUNCE=y
CONFIG_VIRT_TO_BUS=y
# CONFIG_UNEVICTABLE_LRU is not set
CONFIG_MMU_NOTIFIER=y
# CONFIG_HIGHPTE is not set
# CONFIG_X86_CHECK_BIOS_CORRUPTION is not set
CONFIG_X86_RESERVE_LOW_64K=y
CONFIG_MATH_EMULATION=y
CONFIG_MTRR=y
CONFIG_MTRR_SANITIZER=y
CONFIG_MTRR_SANITIZER_ENABLE_DEFAULT=0
CONFIG_MTRR_SANITIZER_SPARE_REG_NR_DEFAULT=1
# CONFIG_X86_PAT is not set
# CONFIG_SECCOMP is not set
# CONFIG_CC_STACKPROTECTOR is not set
# CONFIG_HZ_100 is not set
# CONFIG_HZ_250 is not set
CONFIG_HZ_300=y
# CONFIG_HZ_1000 is not set
CONFIG_HZ=300
# CONFIG_SCHED_HRTICK is not set
CONFIG_KEXEC=y
CONFIG_CRASH_DUMP=y
CONFIG_PHYSICAL_START=0x100000
CONFIG_RELOCATABLE=y
CONFIG_PHYSICAL_ALIGN=0x100000
CONFIG_HOTPLUG_CPU=y
CONFIG_COMPAT_VDSO=y
CONFIG_CMDLINE_BOOL=y
CONFIG_CMDLINE=""
CONFIG_ARCH_ENABLE_MEMORY_HOTPLUG=y
CONFIG_ARCH_ENABLE_MEMORY_HOTREMOVE=y

#
# Power management and ACPI options
#
# CONFIG_PM is not set

#
# CPU Frequency scaling
#
CONFIG_CPU_FREQ=y
CONFIG_CPU_FREQ_TABLE=y
CONFIG_CPU_FREQ_DEBUG=y
CONFIG_CPU_FREQ_STAT=m
# CONFIG_CPU_FREQ_STAT_DETAILS is not set
# CONFIG_CPU_FREQ_DEFAULT_GOV_PERFORMANCE is not set
# CONFIG_CPU_FREQ_DEFAULT_GOV_POWERSAVE is not set
# CONFIG_CPU_FREQ_DEFAULT_GOV_USERSPACE is not set
# CONFIG_CPU_FREQ_DEFAULT_GOV_ONDEMAND is not set
CONFIG_CPU_FREQ_DEFAULT_GOV_CONSERVATIVE=y
CONFIG_CPU_FREQ_GOV_PERFORMANCE=y
# CONFIG_CPU_FREQ_GOV_POWERSAVE is not set
CONFIG_CPU_FREQ_GOV_USERSPACE=m
CONFIG_CPU_FREQ_GOV_ONDEMAND=m
CONFIG_CPU_FREQ_GOV_CONSERVATIVE=y

#
# CPUFreq processor drivers
#
# CONFIG_X86_POWERNOW_K6 is not set
CONFIG_X86_POWERNOW_K7=m
CONFIG_X86_POWERNOW_K8=y
CONFIG_X86_GX_SUSPMOD=m
# CONFIG_X86_SPEEDSTEP_CENTRINO is not set
CONFIG_X86_SPEEDSTEP_ICH=m
CONFIG_X86_SPEEDSTEP_SMI=y
# CONFIG_X86_P4_CLOCKMOD is not set
CONFIG_X86_CPUFREQ_NFORCE2=m
CONFIG_X86_LONGRUN=y
CONFIG_X86_E_POWERSAVER=y

#
# shared options
#
CONFIG_X86_SPEEDSTEP_LIB=y
# CONFIG_X86_SPEEDSTEP_RELAXED_CAP_CHECK is not set
# CONFIG_CPU_IDLE is not set

#
# Bus options (PCI etc.)
#
CONFIG_PCI=y
# CONFIG_PCI_GOBIOS is not set
CONFIG_PCI_GOMMCONFIG=y
# CONFIG_PCI_GODIRECT is not set
# CONFIG_PCI_GOOLPC is not set
# CONFIG_PCI_GOANY is not set
CONFIG_PCI_BIOS=y
CONFIG_PCI_DOMAINS=y
CONFIG_PCIEPORTBUS=y
CONFIG_PCIEAER=y
CONFIG_PCIEASPM=y
CONFIG_PCIEASPM_DEBUG=y
CONFIG_ARCH_SUPPORTS_MSI=y
CONFIG_PCI_MSI=y
# CONFIG_PCI_LEGACY is not set
# CONFIG_PCI_STUB is not set
# CONFIG_HT_IRQ is not set
CONFIG_ISA_DMA_API=y
# CONFIG_ISA is not set
CONFIG_MCA=y
CONFIG_MCA_LEGACY=y
CONFIG_MCA_PROC_FS=y
CONFIG_SCx200=y
CONFIG_SCx200HR_TIMER=y
# CONFIG_OLPC is not set
CONFIG_K8_NB=y
CONFIG_PCCARD=y
# CONFIG_PCMCIA_DEBUG is not set
CONFIG_PCMCIA=y
CONFIG_PCMCIA_LOAD_CIS=y
CONFIG_PCMCIA_IOCTL=y
# CONFIG_CARDBUS is not set

#
# PC-card bridges
#
CONFIG_YENTA=y
# CONFIG_YENTA_O2 is not set
CONFIG_YENTA_RICOH=y
CONFIG_YENTA_TI=y
# CONFIG_YENTA_TOSHIBA is not set
# CONFIG_PD6729 is not set
# CONFIG_I82092 is not set
CONFIG_PCCARD_NONSTATIC=y
# CONFIG_HOTPLUG_PCI is not set
# CONFIG_BOOTPARAM_NMI_WATCHDOG_BIT_0 is not set
CONFIG_BOOTPARAM_NOLAPIC_TIMER=y
CONFIG_BOOTPARAM_LAPIC=y
CONFIG_BOOTPARAM_HPET_DISABLE=y
CONFIG_BOOTPARAM_IDLE_MWAIT=y
CONFIG_BOOTPARAM_IDLE_POLL=y
# CONFIG_BOOTPARAM_HIGHMEM_512M is not set
# CONFIG_BOOTPARAM_NOPAT is not set
CONFIG_BOOTPARAM_NOTSC=y
# CONFIG_BOOTPARAM_ACPI_OFF is not set
# CONFIG_BOOTPARAM_PCI_NOMSI is not set

#
# Executable file formats / Emulations
#
CONFIG_BINFMT_ELF=y
CONFIG_CORE_DUMP_DEFAULT_ELF_HEADERS=y
CONFIG_HAVE_AOUT=y
CONFIG_BINFMT_AOUT=m
# CONFIG_BINFMT_MISC is not set
CONFIG_HAVE_ATOMIC_IOMAP=y
CONFIG_NET=y

#
# Networking options
#
CONFIG_COMPAT_NET_DEV_OPS=y
CONFIG_PACKET=y
# CONFIG_PACKET_MMAP is not set
CONFIG_UNIX=y
CONFIG_XFRM=y
# CONFIG_XFRM_USER is not set
CONFIG_XFRM_SUB_POLICY=y
CONFIG_XFRM_MIGRATE=y
CONFIG_XFRM_STATISTICS=y
CONFIG_XFRM_IPCOMP=y
CONFIG_NET_KEY=m
CONFIG_NET_KEY_MIGRATE=y
CONFIG_INET=y
# CONFIG_IP_MULTICAST is not set
CONFIG_IP_ADVANCED_ROUTER=y
CONFIG_ASK_IP_FIB_HASH=y
# CONFIG_IP_FIB_TRIE is not set
CONFIG_IP_FIB_HASH=y
CONFIG_IP_MULTIPLE_TABLES=y
# CONFIG_IP_ROUTE_MULTIPATH is not set
CONFIG_IP_ROUTE_VERBOSE=y
# CONFIG_IP_PNP is not set
# CONFIG_NET_IPIP is not set
# CONFIG_NET_IPGRE is not set
# CONFIG_ARPD is not set
CONFIG_SYN_COOKIES=y
CONFIG_INET_AH=y
CONFIG_INET_ESP=y
# CONFIG_INET_IPCOMP is not set
# CONFIG_INET_XFRM_TUNNEL is not set
# CONFIG_INET_TUNNEL is not set
CONFIG_INET_XFRM_MODE_TRANSPORT=y
CONFIG_INET_XFRM_MODE_TUNNEL=m
CONFIG_INET_XFRM_MODE_BEET=m
CONFIG_INET_LRO=y
CONFIG_INET_DIAG=y
CONFIG_INET_TCP_DIAG=y
CONFIG_TCP_CONG_ADVANCED=y
# CONFIG_TCP_CONG_BIC is not set
CONFIG_TCP_CONG_CUBIC=y
# CONFIG_TCP_CONG_WESTWOOD is not set
# CONFIG_TCP_CONG_HTCP is not set
# CONFIG_TCP_CONG_HSTCP is not set
# CONFIG_TCP_CONG_HYBLA is not set
# CONFIG_TCP_CONG_VEGAS is not set
CONFIG_TCP_CONG_SCALABLE=m
CONFIG_TCP_CONG_LP=y
CONFIG_TCP_CONG_VENO=m
# CONFIG_TCP_CONG_YEAH is not set
# CONFIG_TCP_CONG_ILLINOIS is not set
# CONFIG_DEFAULT_BIC is not set
CONFIG_DEFAULT_CUBIC=y
# CONFIG_DEFAULT_HTCP is not set
# CONFIG_DEFAULT_VEGAS is not set
# CONFIG_DEFAULT_WESTWOOD is not set
# CONFIG_DEFAULT_RENO is not set
CONFIG_DEFAULT_TCP_CONG="cubic"
CONFIG_TCP_MD5SIG=y
CONFIG_IPV6=y
CONFIG_IPV6_PRIVACY=y
# CONFIG_IPV6_ROUTER_PREF is not set
# CONFIG_IPV6_OPTIMISTIC_DAD is not set
CONFIG_INET6_AH=y
CONFIG_INET6_ESP=m
CONFIG_INET6_IPCOMP=y
CONFIG_IPV6_MIP6=y
CONFIG_INET6_XFRM_TUNNEL=y
CONFIG_INET6_TUNNEL=y
# CONFIG_INET6_XFRM_MODE_TRANSPORT is not set
CONFIG_INET6_XFRM_MODE_TUNNEL=y
# CONFIG_INET6_XFRM_MODE_BEET is not set
# CONFIG_INET6_XFRM_MODE_ROUTEOPTIMIZATION is not set
# CONFIG_IPV6_SIT is not set
CONFIG_IPV6_TUNNEL=y
# CONFIG_IPV6_MULTIPLE_TABLES is not set
CONFIG_IPV6_MROUTE=y
CONFIG_IPV6_PIMSM_V2=y
CONFIG_NETLABEL=y
# CONFIG_NETWORK_SECMARK is not set
# CONFIG_NETFILTER is not set
CONFIG_IP_DCCP=y
CONFIG_INET_DCCP_DIAG=y

#
# DCCP CCIDs Configuration (EXPERIMENTAL)
#
CONFIG_IP_DCCP_CCID2_DEBUG=y
CONFIG_IP_DCCP_CCID3=y
CONFIG_IP_DCCP_CCID3_DEBUG=y
CONFIG_IP_DCCP_CCID3_RTO=100
CONFIG_IP_DCCP_TFRC_LIB=y
CONFIG_IP_DCCP_TFRC_DEBUG=y
# CONFIG_IP_SCTP is not set
CONFIG_TIPC=m
CONFIG_TIPC_ADVANCED=y
CONFIG_TIPC_ZONES=3
CONFIG_TIPC_CLUSTERS=1
CONFIG_TIPC_NODES=255
CONFIG_TIPC_SLAVE_NODES=0
CONFIG_TIPC_PORTS=8191
CONFIG_TIPC_LOG=0
# CONFIG_TIPC_DEBUG is not set
CONFIG_ATM=y
# CONFIG_ATM_CLIP is not set
# CONFIG_ATM_LANE is not set
# CONFIG_ATM_BR2684 is not set
CONFIG_STP=y
CONFIG_GARP=y
CONFIG_BRIDGE=y
CONFIG_NET_DSA=y
CONFIG_NET_DSA_TAG_DSA=y
CONFIG_NET_DSA_TAG_EDSA=y
# CONFIG_NET_DSA_TAG_TRAILER is not set
CONFIG_NET_DSA_MV88E6XXX=y
# CONFIG_NET_DSA_MV88E6060 is not set
CONFIG_NET_DSA_MV88E6XXX_NEED_PPU=y
CONFIG_NET_DSA_MV88E6131=y
CONFIG_NET_DSA_MV88E6123_61_65=y
CONFIG_VLAN_8021Q=y
CONFIG_VLAN_8021Q_GVRP=y
CONFIG_DECNET=m
# CONFIG_DECNET_ROUTER is not set
CONFIG_LLC=y
CONFIG_LLC2=m
# CONFIG_IPX is not set
CONFIG_ATALK=y
CONFIG_DEV_APPLETALK=m
CONFIG_IPDDP=m
CONFIG_IPDDP_ENCAP=y
CONFIG_IPDDP_DECAP=y
CONFIG_X25=m
CONFIG_LAPB=y
# CONFIG_ECONET is not set
# CONFIG_WAN_ROUTER is not set
CONFIG_NET_SCHED=y

#
# Queueing/Scheduling
#
CONFIG_NET_SCH_CBQ=m
CONFIG_NET_SCH_HTB=y
CONFIG_NET_SCH_HFSC=m
CONFIG_NET_SCH_ATM=m
# CONFIG_NET_SCH_PRIO is not set
# CONFIG_NET_SCH_MULTIQ is not set
CONFIG_NET_SCH_RED=y
CONFIG_NET_SCH_SFQ=y
# CONFIG_NET_SCH_TEQL is not set
# CONFIG_NET_SCH_TBF is not set
CONFIG_NET_SCH_GRED=m
CONFIG_NET_SCH_DSMARK=y
# CONFIG_NET_SCH_NETEM is not set
CONFIG_NET_SCH_DRR=y
CONFIG_NET_SCH_INGRESS=y

#
# Classification
#
CONFIG_NET_CLS=y
CONFIG_NET_CLS_BASIC=y
CONFIG_NET_CLS_TCINDEX=m
# CONFIG_NET_CLS_ROUTE4 is not set
# CONFIG_NET_CLS_FW is not set
CONFIG_NET_CLS_U32=y
CONFIG_CLS_U32_PERF=y
CONFIG_CLS_U32_MARK=y
CONFIG_NET_CLS_RSVP=y
CONFIG_NET_CLS_RSVP6=m
CONFIG_NET_CLS_FLOW=m
CONFIG_NET_CLS_CGROUP=y
# CONFIG_NET_EMATCH is not set
CONFIG_NET_CLS_ACT=y
# CONFIG_NET_ACT_POLICE is not set
# CONFIG_NET_ACT_GACT is not set
CONFIG_NET_ACT_MIRRED=m
CONFIG_NET_ACT_NAT=y
CONFIG_NET_ACT_PEDIT=m
# CONFIG_NET_ACT_SIMP is not set
CONFIG_NET_ACT_SKBEDIT=m
CONFIG_NET_CLS_IND=y
CONFIG_NET_SCH_FIFO=y
# CONFIG_DCB is not set

#
# Network testing
#
# CONFIG_NET_PKTGEN is not set
# CONFIG_NET_TCPPROBE is not set
CONFIG_HAMRADIO=y

#
# Packet Radio protocols
#
# CONFIG_AX25 is not set
CONFIG_CAN=m
CONFIG_CAN_RAW=m
# CONFIG_CAN_BCM is not set

#
# CAN Device Drivers
#
# CONFIG_CAN_VCAN is not set
CONFIG_CAN_DEBUG_DEVICES=y
# CONFIG_IRDA is not set
CONFIG_BT=y
# CONFIG_BT_L2CAP is not set
# CONFIG_BT_SCO is not set

#
# Bluetooth device drivers
#
# CONFIG_BT_HCIBTUSB is not set
CONFIG_BT_HCIBTSDIO=m
CONFIG_BT_HCIUART=y
CONFIG_BT_HCIUART_H4=y
CONFIG_BT_HCIUART_BCSP=y
CONFIG_BT_HCIUART_LL=y
# CONFIG_BT_HCIBCM203X is not set
CONFIG_BT_HCIBPA10X=m
CONFIG_BT_HCIBFUSB=m
CONFIG_BT_HCIDTL1=m
CONFIG_BT_HCIBT3C=y
# CONFIG_BT_HCIBLUECARD is not set
CONFIG_BT_HCIBTUART=m
# CONFIG_BT_HCIVHCI is not set
CONFIG_AF_RXRPC=y
# CONFIG_AF_RXRPC_DEBUG is not set
CONFIG_RXKAD=y
CONFIG_PHONET=m
CONFIG_FIB_RULES=y
CONFIG_WIRELESS=y
CONFIG_CFG80211=m
CONFIG_CFG80211_REG_DEBUG=y
# CONFIG_NL80211 is not set
CONFIG_WIRELESS_OLD_REGULATORY=y
CONFIG_WIRELESS_EXT=y
CONFIG_WIRELESS_EXT_SYSFS=y
CONFIG_LIB80211=m
CONFIG_LIB80211_DEBUG=y
CONFIG_MAC80211=m

#
# Rate control algorithm selection
#
# CONFIG_MAC80211_RC_PID is not set
CONFIG_MAC80211_RC_MINSTREL=y
# CONFIG_MAC80211_RC_DEFAULT_PID is not set
CONFIG_MAC80211_RC_DEFAULT_MINSTREL=y
CONFIG_MAC80211_RC_DEFAULT="minstrel"
# CONFIG_MAC80211_MESH is not set
CONFIG_MAC80211_LEDS=y
CONFIG_MAC80211_DEBUGFS=y
CONFIG_MAC80211_DEBUG_MENU=y
# CONFIG_MAC80211_DEBUG_PACKET_ALIGNMENT is not set
# CONFIG_MAC80211_NOINLINE is not set
CONFIG_MAC80211_VERBOSE_DEBUG=y
# CONFIG_MAC80211_HT_DEBUG is not set
CONFIG_MAC80211_TKIP_DEBUG=y
# CONFIG_MAC80211_IBSS_DEBUG is not set
# CONFIG_MAC80211_VERBOSE_PS_DEBUG is not set
CONFIG_MAC80211_DEBUG_COUNTERS=y
# CONFIG_MAC80211_VERBOSE_SPECT_MGMT_DEBUG is not set
CONFIG_WIMAX=m
CONFIG_WIMAX_DEBUG_LEVEL=8
# CONFIG_RFKILL is not set

#
# Device Drivers
#

#
# Generic Driver Options
#
CONFIG_UEVENT_HELPER_PATH="/sbin/hotplug"
CONFIG_STANDALONE=y
CONFIG_PREVENT_FIRMWARE_BUILD=y
CONFIG_FW_LOADER=y
# CONFIG_FIRMWARE_IN_KERNEL is not set
CONFIG_EXTRA_FIRMWARE=""
# CONFIG_SYS_HYPERVISOR is not set
# CONFIG_CONNECTOR is not set
CONFIG_PARPORT=y
# CONFIG_PARPORT_PC is not set
# CONFIG_PARPORT_GSC is not set
CONFIG_PARPORT_AX88796=y
CONFIG_PARPORT_1284=y
CONFIG_PARPORT_NOT_PC=y
CONFIG_BLK_DEV=y
# CONFIG_BLK_DEV_FD is not set
CONFIG_BLK_CPQ_DA=y
# CONFIG_BLK_CPQ_CISS_DA is not set
CONFIG_BLK_DEV_DAC960=m
# CONFIG_BLK_DEV_UMEM is not set
# CONFIG_BLK_DEV_COW_COMMON is not set
CONFIG_BLK_DEV_LOOP=m
# CONFIG_BLK_DEV_CRYPTOLOOP is not set
CONFIG_BLK_DEV_NBD=y
CONFIG_BLK_DEV_SX8=y
CONFIG_BLK_DEV_UB=m
CONFIG_BLK_DEV_RAM=y
CONFIG_BLK_DEV_RAM_COUNT=16
CONFIG_BLK_DEV_RAM_SIZE=4096
# CONFIG_BLK_DEV_XIP is not set
CONFIG_CDROM_PKTCDVD=y
CONFIG_CDROM_PKTCDVD_BUFFERS=8
# CONFIG_CDROM_PKTCDVD_WCACHE is not set
CONFIG_ATA_OVER_ETH=y
CONFIG_VIRTIO_BLK=m
CONFIG_BLK_DEV_HD=y
CONFIG_MISC_DEVICES=y
# CONFIG_IBM_ASM is not set
CONFIG_PHANTOM=m
CONFIG_SGI_IOC4=m
CONFIG_TIFM_CORE=m
# CONFIG_TIFM_7XX1 is not set
# CONFIG_ICS932S401 is not set
# CONFIG_ENCLOSURE_SERVICES is not set
# CONFIG_HP_ILO is not set
CONFIG_C2PORT=m
CONFIG_C2PORT_DURAMAR_2150=m

#
# EEPROM support
#
CONFIG_EEPROM_AT24=m
CONFIG_EEPROM_AT25=m
CONFIG_EEPROM_LEGACY=y
CONFIG_EEPROM_93CX6=y
CONFIG_HAVE_IDE=y

#
# SCSI device support
#
# CONFIG_RAID_ATTRS is not set
CONFIG_SCSI=y
CONFIG_SCSI_DMA=y
CONFIG_SCSI_TGT=m
CONFIG_SCSI_NETLINK=y
CONFIG_SCSI_PROC_FS=y

#
# SCSI support type (disk, tape, CD-ROM)
#
CONFIG_BLK_DEV_SD=y
CONFIG_CHR_DEV_ST=m
CONFIG_CHR_DEV_OSST=y
CONFIG_BLK_DEV_SR=m
# CONFIG_BLK_DEV_SR_VENDOR is not set
# CONFIG_CHR_DEV_SG is not set
CONFIG_CHR_DEV_SCH=m

#
# Some SCSI devices (e.g. CD jukebox) support multiple LUNs
#
CONFIG_SCSI_MULTI_LUN=y
# CONFIG_SCSI_CONSTANTS is not set
CONFIG_SCSI_LOGGING=y
CONFIG_SCSI_SCAN_ASYNC=y
CONFIG_SCSI_WAIT_SCAN=m

#
# SCSI Transports
#
CONFIG_SCSI_SPI_ATTRS=y
CONFIG_SCSI_FC_ATTRS=y
CONFIG_SCSI_ISCSI_ATTRS=y
# CONFIG_SCSI_SAS_ATTRS is not set
CONFIG_SCSI_SRP_ATTRS=m
# CONFIG_SCSI_SRP_TGT_ATTRS is not set
CONFIG_SCSI_LOWLEVEL=y
CONFIG_ISCSI_TCP=m
CONFIG_SCSI_CXGB3_ISCSI=m
CONFIG_BLK_DEV_3W_XXXX_RAID=y
CONFIG_SCSI_3W_9XXX=m
CONFIG_SCSI_ACARD=y
CONFIG_SCSI_AACRAID=y
CONFIG_SCSI_AIC7XXX=y
CONFIG_AIC7XXX_CMDS_PER_DEVICE=32
CONFIG_AIC7XXX_RESET_DELAY_MS=5000
CONFIG_AIC7XXX_DEBUG_ENABLE=y
CONFIG_AIC7XXX_DEBUG_MASK=0
CONFIG_AIC7XXX_REG_PRETTY_PRINT=y
CONFIG_SCSI_AIC7XXX_OLD=y
CONFIG_SCSI_AIC79XX=y
CONFIG_AIC79XX_CMDS_PER_DEVICE=32
CONFIG_AIC79XX_RESET_DELAY_MS=5000
CONFIG_AIC79XX_DEBUG_ENABLE=y
CONFIG_AIC79XX_DEBUG_MASK=0
# CONFIG_AIC79XX_REG_PRETTY_PRINT is not set
# CONFIG_SCSI_DPT_I2O is not set
CONFIG_SCSI_ADVANSYS=y
CONFIG_SCSI_ARCMSR=y
CONFIG_SCSI_ARCMSR_AER=y
CONFIG_MEGARAID_NEWGEN=y
# CONFIG_MEGARAID_MM is not set
CONFIG_MEGARAID_LEGACY=y
# CONFIG_MEGARAID_SAS is not set
# CONFIG_SCSI_HPTIOP is not set
# CONFIG_SCSI_BUSLOGIC is not set
CONFIG_LIBFC=y
CONFIG_FCOE=y
# CONFIG_SCSI_DMX3191D is not set
CONFIG_SCSI_EATA=m
CONFIG_SCSI_EATA_TAGGED_QUEUE=y
# CONFIG_SCSI_EATA_LINKED_COMMANDS is not set
CONFIG_SCSI_EATA_MAX_TAGS=16
CONFIG_SCSI_FUTURE_DOMAIN=m
CONFIG_SCSI_FD_MCS=y
CONFIG_SCSI_GDTH=m
CONFIG_SCSI_IBMMCA=y
CONFIG_IBMMCA_SCSI_ORDER_STANDARD=y
CONFIG_IBMMCA_SCSI_DEV_RESET=y
# CONFIG_SCSI_IPS is not set
CONFIG_SCSI_INITIO=m
CONFIG_SCSI_INIA100=m
# CONFIG_SCSI_NCR_D700 is not set
CONFIG_SCSI_STEX=m
# CONFIG_SCSI_SYM53C8XX_2 is not set
CONFIG_SCSI_IPR=m
CONFIG_SCSI_IPR_TRACE=y
CONFIG_SCSI_IPR_DUMP=y
CONFIG_SCSI_NCR_Q720=y
CONFIG_SCSI_NCR53C8XX_DEFAULT_TAGS=8
CONFIG_SCSI_NCR53C8XX_MAX_TAGS=32
CONFIG_SCSI_NCR53C8XX_SYNC=20
CONFIG_SCSI_QLOGIC_1280=m
CONFIG_SCSI_QLA_FC=m
# CONFIG_SCSI_QLA_ISCSI is not set
CONFIG_SCSI_LPFC=m
CONFIG_SCSI_LPFC_DEBUG_FS=y
CONFIG_SCSI_SIM710=y
CONFIG_SCSI_DC395x=m
CONFIG_SCSI_DC390T=y
# CONFIG_SCSI_NSP32 is not set
# CONFIG_SCSI_SRP is not set
# CONFIG_SCSI_LOWLEVEL_PCMCIA is not set
CONFIG_SCSI_DH=y
CONFIG_SCSI_DH_RDAC=m
# CONFIG_SCSI_DH_HP_SW is not set
CONFIG_SCSI_DH_EMC=m
CONFIG_SCSI_DH_ALUA=y
CONFIG_ATA=y
# CONFIG_ATA_NONSTANDARD is not set
CONFIG_SATA_PMP=y
CONFIG_SATA_AHCI=y
CONFIG_SATA_SIL24=m
CONFIG_ATA_SFF=y
CONFIG_SATA_SVW=y
CONFIG_ATA_PIIX=y
CONFIG_SATA_MV=m
CONFIG_SATA_NV=y
# CONFIG_PDC_ADMA is not set
CONFIG_SATA_QSTOR=y
CONFIG_SATA_PROMISE=y
CONFIG_SATA_SX4=y
CONFIG_SATA_SIL=y
# CONFIG_SATA_SIS is not set
CONFIG_SATA_ULI=m
CONFIG_SATA_VIA=m
CONFIG_SATA_VITESSE=m
# CONFIG_SATA_INIC162X is not set
# CONFIG_PATA_ALI is not set
CONFIG_PATA_AMD=y
# CONFIG_PATA_ARTOP is not set
CONFIG_PATA_ATIIXP=m
CONFIG_PATA_CMD640_PCI=m
CONFIG_PATA_CMD64X=y
CONFIG_PATA_CS5520=m
# CONFIG_PATA_CS5530 is not set
CONFIG_PATA_CS5535=y
# CONFIG_PATA_CS5536 is not set
CONFIG_PATA_CYPRESS=y
CONFIG_PATA_EFAR=y
# CONFIG_ATA_GENERIC is not set
CONFIG_PATA_HPT366=y
# CONFIG_PATA_HPT37X is not set
# CONFIG_PATA_HPT3X2N is not set
CONFIG_PATA_HPT3X3=y
CONFIG_PATA_HPT3X3_DMA=y
# CONFIG_PATA_IT821X is not set
# CONFIG_PATA_IT8213 is not set
CONFIG_PATA_JMICRON=y
# CONFIG_PATA_TRIFLEX is not set
CONFIG_PATA_MARVELL=y
CONFIG_PATA_MPIIX=m
CONFIG_PATA_OLDPIIX=y
# CONFIG_PATA_NETCELL is not set
# CONFIG_PATA_NINJA32 is not set
CONFIG_PATA_NS87410=m
CONFIG_PATA_NS87415=y
CONFIG_PATA_OPTI=y
# CONFIG_PATA_OPTIDMA is not set
CONFIG_PATA_PCMCIA=y
CONFIG_PATA_PDC_OLD=m
# CONFIG_PATA_RADISYS is not set
# CONFIG_PATA_RZ1000 is not set
CONFIG_PATA_SC1200=y
CONFIG_PATA_SERVERWORKS=m
# CONFIG_PATA_PDC2027X is not set
CONFIG_PATA_SIL680=y
# CONFIG_PATA_SIS is not set
CONFIG_PATA_VIA=m
CONFIG_PATA_WINBOND=y
# CONFIG_PATA_PLATFORM is not set
CONFIG_PATA_SCH=y
CONFIG_MD=y
CONFIG_BLK_DEV_MD=y
CONFIG_MD_AUTODETECT=y
CONFIG_MD_LINEAR=m
CONFIG_MD_RAID0=m
# CONFIG_MD_RAID1 is not set
# CONFIG_MD_RAID10 is not set
# CONFIG_MD_RAID456 is not set
# CONFIG_MD_MULTIPATH is not set
CONFIG_MD_FAULTY=y
CONFIG_BLK_DEV_DM=m
# CONFIG_DM_DEBUG is not set
CONFIG_DM_CRYPT=m
# CONFIG_DM_SNAPSHOT is not set
CONFIG_DM_MIRROR=m
CONFIG_DM_ZERO=m
# CONFIG_DM_MULTIPATH is not set
CONFIG_DM_DELAY=m
CONFIG_DM_UEVENT=y
CONFIG_FUSION=y
CONFIG_FUSION_SPI=y
CONFIG_FUSION_FC=y
# CONFIG_FUSION_SAS is not set
CONFIG_FUSION_MAX_SGE=128
CONFIG_FUSION_CTL=y
# CONFIG_FUSION_LOGGING is not set

#
# IEEE 1394 (FireWire) support
#

#
# Enable only one of the two stacks, unless you know what you are doing
#
CONFIG_FIREWIRE=m
CONFIG_FIREWIRE_OHCI=m
CONFIG_FIREWIRE_OHCI_DEBUG=y
CONFIG_FIREWIRE_SBP2=m
# CONFIG_IEEE1394 is not set
CONFIG_I2O=m
CONFIG_I2O_LCT_NOTIFY_ON_CHANGES=y
# CONFIG_I2O_EXT_ADAPTEC is not set
# CONFIG_I2O_CONFIG is not set
CONFIG_I2O_BUS=m
CONFIG_I2O_BLOCK=m
# CONFIG_I2O_SCSI is not set
CONFIG_I2O_PROC=m
CONFIG_MACINTOSH_DRIVERS=y
# CONFIG_MAC_EMUMOUSEBTN is not set
CONFIG_NETDEVICES=y
CONFIG_IFB=m
CONFIG_DUMMY=m
CONFIG_BONDING=y
# CONFIG_MACVLAN is not set
CONFIG_EQUALIZER=m
CONFIG_TUN=y
CONFIG_VETH=y
CONFIG_ARCNET=y
# CONFIG_ARCNET_1201 is not set
# CONFIG_ARCNET_1051 is not set
CONFIG_ARCNET_RAW=m
# CONFIG_ARCNET_CAP is not set
CONFIG_ARCNET_COM90xx=y
# CONFIG_ARCNET_COM90xxIO is not set
# CONFIG_ARCNET_RIM_I is not set
# CONFIG_ARCNET_COM20020 is not set
CONFIG_PHYLIB=y

#
# MII PHY device drivers
#
CONFIG_MARVELL_PHY=y
CONFIG_DAVICOM_PHY=y
CONFIG_QSEMI_PHY=y
CONFIG_LXT_PHY=m
CONFIG_CICADA_PHY=m
# CONFIG_VITESSE_PHY is not set
# CONFIG_SMSC_PHY is not set
# CONFIG_BROADCOM_PHY is not set
CONFIG_ICPLUS_PHY=y
CONFIG_REALTEK_PHY=m
# CONFIG_NATIONAL_PHY is not set
CONFIG_STE10XP=y
CONFIG_LSI_ET1011C_PHY=m
CONFIG_FIXED_PHY=y
# CONFIG_MDIO_BITBANG is not set
CONFIG_NET_ETHERNET=y
CONFIG_MII=y
CONFIG_HAPPYMEAL=y
CONFIG_SUNGEM=m
CONFIG_CASSINI=m
# CONFIG_NET_VENDOR_3COM is not set
CONFIG_VORTEX=y
CONFIG_NET_VENDOR_SMC=y
# CONFIG_ULTRAMCA is not set
CONFIG_ENC28J60=m
CONFIG_ENC28J60_WRITEVERIFY=y
# CONFIG_DNET is not set
CONFIG_NET_TULIP=y
CONFIG_DE2104X=m
CONFIG_TULIP=y
CONFIG_TULIP_MWI=y
# CONFIG_TULIP_MMIO is not set
CONFIG_TULIP_NAPI=y
CONFIG_TULIP_NAPI_HW_MITIGATION=y
CONFIG_DE4X5=y
CONFIG_WINBOND_840=y
CONFIG_DM9102=y
# CONFIG_ULI526X is not set
CONFIG_AT1700=y
CONFIG_DEPCA=y
CONFIG_HP100=y
CONFIG_NE2_MCA=y
CONFIG_IBMLANA=y
# CONFIG_IBM_NEW_EMAC_ZMII is not set
# CONFIG_IBM_NEW_EMAC_RGMII is not set
# CONFIG_IBM_NEW_EMAC_TAH is not set
# CONFIG_IBM_NEW_EMAC_EMAC4 is not set
# CONFIG_IBM_NEW_EMAC_NO_FLOW_CTRL is not set
# CONFIG_IBM_NEW_EMAC_MAL_CLR_ICINTSTAT is not set
# CONFIG_IBM_NEW_EMAC_MAL_COMMON_ERR is not set
CONFIG_NET_PCI=y
CONFIG_PCNET32=m
CONFIG_AMD8111_ETH=y
CONFIG_ADAPTEC_STARFIRE=m
# CONFIG_B44 is not set
CONFIG_FORCEDETH=y
CONFIG_FORCEDETH_NAPI=y
CONFIG_E100=y
CONFIG_FEALNX=m
CONFIG_NATSEMI=m
CONFIG_NE2K_PCI=m
CONFIG_8139CP=m
CONFIG_8139TOO=y
CONFIG_8139TOO_PIO=y
# CONFIG_8139TOO_TUNE_TWISTER is not set
CONFIG_8139TOO_8129=y
# CONFIG_8139_OLD_RX_RESET is not set
# CONFIG_R6040 is not set
CONFIG_SIS900=m
# CONFIG_EPIC100 is not set
# CONFIG_SMSC9420 is not set
CONFIG_SUNDANCE=m
CONFIG_SUNDANCE_MMIO=y
# CONFIG_TLAN is not set
# CONFIG_VIA_RHINE is not set
# CONFIG_SC92031 is not set
CONFIG_NET_POCKET=y
CONFIG_ATP=y
CONFIG_DE600=y
# CONFIG_DE620 is not set
CONFIG_ATL2=y
CONFIG_NETDEV_1000=y
CONFIG_ACENIC=y
CONFIG_ACENIC_OMIT_TIGON_I=y
CONFIG_DL2K=y
CONFIG_E1000=m
CONFIG_E1000E=y
CONFIG_IP1000=y
CONFIG_IGB=m
CONFIG_IGB_LRO=y
# CONFIG_NS83820 is not set
# CONFIG_HAMACHI is not set
# CONFIG_YELLOWFIN is not set
CONFIG_R8169=m
CONFIG_R8169_VLAN=y
CONFIG_SIS190=y
CONFIG_SKGE=m
CONFIG_SKGE_DEBUG=y
CONFIG_SKY2=y
CONFIG_SKY2_DEBUG=y
CONFIG_VIA_VELOCITY=m
CONFIG_TIGON3=y
# CONFIG_BNX2 is not set
# CONFIG_QLA3XXX is not set
CONFIG_ATL1=m
CONFIG_ATL1E=m
# CONFIG_ATL1C is not set
CONFIG_JME=y
CONFIG_NETDEV_10000=y
CONFIG_CHELSIO_T1=y
CONFIG_CHELSIO_T1_1G=y
CONFIG_CHELSIO_T3_DEPENDS=y
CONFIG_CHELSIO_T3=y
# CONFIG_ENIC is not set
CONFIG_IXGBE=y
CONFIG_IXGB=y
CONFIG_S2IO=y
CONFIG_MYRI10GE=y
CONFIG_NIU=y
# CONFIG_MLX4_EN is not set
# CONFIG_MLX4_CORE is not set
# CONFIG_TEHUTI is not set
# CONFIG_BNX2X is not set
CONFIG_QLGE=m
CONFIG_SFC=m
CONFIG_BE2NET=m
CONFIG_TR=y
# CONFIG_IBMTR is not set
CONFIG_IBMOL=y
# CONFIG_IBMLS is not set
CONFIG_3C359=y
# CONFIG_TMS380TR is not set
# CONFIG_SMCTR is not set

#
# Wireless LAN
#
# CONFIG_WLAN_PRE80211 is not set
# CONFIG_WLAN_80211 is not set
# CONFIG_IWLWIFI_LEDS is not set

#
# WiMAX Wireless Broadband devices
#
CONFIG_WIMAX_I2400M=m
CONFIG_WIMAX_I2400M_SDIO=m
CONFIG_WIMAX_I2400M_DEBUG_LEVEL=8

#
# USB Network Adapters
#
# CONFIG_USB_CATC is not set
CONFIG_USB_KAWETH=y
# CONFIG_USB_PEGASUS is not set
CONFIG_USB_RTL8150=y
CONFIG_USB_USBNET=y
CONFIG_USB_NET_AX8817X=m
CONFIG_USB_NET_CDCETHER=y
# CONFIG_USB_NET_DM9601 is not set
CONFIG_USB_NET_SMSC95XX=y
CONFIG_USB_NET_GL620A=m
# CONFIG_USB_NET_NET1080 is not set
CONFIG_USB_NET_PLUSB=y
# CONFIG_USB_NET_MCS7830 is not set
# CONFIG_USB_NET_RNDIS_HOST is not set
CONFIG_USB_NET_CDC_SUBSET=m
CONFIG_USB_ALI_M5632=y
# CONFIG_USB_AN2720 is not set
CONFIG_USB_BELKIN=y
# CONFIG_USB_ARMLINUX is not set
# CONFIG_USB_EPSON2888 is not set
CONFIG_USB_KC2190=y
CONFIG_USB_NET_ZAURUS=y
CONFIG_NET_PCMCIA=y
CONFIG_PCMCIA_3C589=m
# CONFIG_PCMCIA_3C574 is not set
CONFIG_PCMCIA_FMVJ18X=m
CONFIG_PCMCIA_PCNET=y
CONFIG_PCMCIA_NMCLAN=y
# CONFIG_PCMCIA_SMC91C92 is not set
CONFIG_PCMCIA_XIRC2PS=m
# CONFIG_PCMCIA_AXNET is not set
CONFIG_PCMCIA_IBMTR=y
# CONFIG_WAN is not set
# CONFIG_ATM_DRIVERS is not set
CONFIG_FDDI=y
# CONFIG_DEFXX is not set
CONFIG_SKFP=y
CONFIG_HIPPI=y
CONFIG_ROADRUNNER=y
CONFIG_ROADRUNNER_LARGE_RINGS=y
CONFIG_PLIP=m
CONFIG_PPP=y
CONFIG_PPP_MULTILINK=y
CONFIG_PPP_FILTER=y
# CONFIG_PPP_ASYNC is not set
CONFIG_PPP_SYNC_TTY=y
# CONFIG_PPP_DEFLATE is not set
CONFIG_PPP_BSDCOMP=y
# CONFIG_PPP_MPPE is not set
CONFIG_PPPOE=m
# CONFIG_PPPOATM is not set
CONFIG_PPPOL2TP=m
# CONFIG_SLIP is not set
CONFIG_SLHC=y
# CONFIG_NET_FC is not set
CONFIG_NETCONSOLE=y
CONFIG_NETCONSOLE_DYNAMIC=y
CONFIG_NETPOLL=y
# CONFIG_NETPOLL_TRAP is not set
CONFIG_NET_POLL_CONTROLLER=y
CONFIG_VIRTIO_NET=y
CONFIG_ISDN=y
# CONFIG_ISDN_I4L is not set
CONFIG_ISDN_CAPI=y
CONFIG_ISDN_DRV_AVMB1_VERBOSE_REASON=y
# CONFIG_CAPI_TRACE is not set
# CONFIG_ISDN_CAPI_MIDDLEWARE is not set
# CONFIG_ISDN_CAPI_CAPI20 is not set

#
# CAPI hardware drivers
#
CONFIG_CAPI_AVM=y
CONFIG_ISDN_DRV_AVMB1_B1PCI=y
CONFIG_ISDN_DRV_AVMB1_B1PCIV4=y
# CONFIG_ISDN_DRV_AVMB1_B1PCMCIA is not set
CONFIG_ISDN_DRV_AVMB1_T1PCI=m
# CONFIG_ISDN_DRV_AVMB1_C4 is not set
CONFIG_CAPI_EICON=y
CONFIG_ISDN_DIVAS=y
CONFIG_ISDN_DIVAS_BRIPCI=y
CONFIG_ISDN_DIVAS_PRIPCI=y
# CONFIG_ISDN_DIVAS_DIVACAPI is not set
CONFIG_ISDN_DIVAS_USERIDI=m
CONFIG_ISDN_DIVAS_MAINT=m
CONFIG_PHONE=m

#
# Input device support
#
CONFIG_INPUT=y
CONFIG_INPUT_FF_MEMLESS=m
CONFIG_INPUT_POLLDEV=y

#
# Userland interfaces
#
CONFIG_INPUT_MOUSEDEV=y
CONFIG_INPUT_MOUSEDEV_PSAUX=y
CONFIG_INPUT_MOUSEDEV_SCREEN_X=1024
CONFIG_INPUT_MOUSEDEV_SCREEN_Y=768
CONFIG_INPUT_JOYDEV=m
# CONFIG_INPUT_EVDEV is not set
# CONFIG_INPUT_EVBUG is not set

#
# Input Device Drivers
#
CONFIG_INPUT_KEYBOARD=y
CONFIG_KEYBOARD_ATKBD=y
CONFIG_KEYBOARD_SUNKBD=m
CONFIG_KEYBOARD_LKKBD=m
# CONFIG_KEYBOARD_XTKBD is not set
# CONFIG_KEYBOARD_NEWTON is not set
CONFIG_KEYBOARD_STOWAWAY=y
CONFIG_KEYBOARD_GPIO=m
CONFIG_INPUT_MOUSE=y
CONFIG_MOUSE_PS2=y
CONFIG_MOUSE_PS2_ALPS=y
CONFIG_MOUSE_PS2_LOGIPS2PP=y
CONFIG_MOUSE_PS2_SYNAPTICS=y
# CONFIG_MOUSE_PS2_LIFEBOOK is not set
CONFIG_MOUSE_PS2_TRACKPOINT=y
# CONFIG_MOUSE_PS2_ELANTECH is not set
CONFIG_MOUSE_PS2_TOUCHKIT=y
CONFIG_MOUSE_SERIAL=y
CONFIG_MOUSE_APPLETOUCH=y
# CONFIG_MOUSE_BCM5974 is not set
CONFIG_MOUSE_VSXXXAA=y
# CONFIG_MOUSE_GPIO is not set
# CONFIG_INPUT_JOYSTICK is not set
# CONFIG_INPUT_TABLET is not set
# CONFIG_INPUT_TOUCHSCREEN is not set
CONFIG_INPUT_MISC=y
# CONFIG_INPUT_APANEL is not set
# CONFIG_INPUT_WISTRON_BTNS is not set
CONFIG_INPUT_ATI_REMOTE=y
CONFIG_INPUT_ATI_REMOTE2=y
CONFIG_INPUT_KEYSPAN_REMOTE=y
CONFIG_INPUT_POWERMATE=y
CONFIG_INPUT_YEALINK=y
# CONFIG_INPUT_CM109 is not set
CONFIG_INPUT_UINPUT=m

#
# Hardware I/O ports
#
CONFIG_SERIO=y
CONFIG_SERIO_I8042=y
CONFIG_SERIO_SERPORT=m
# CONFIG_SERIO_CT82C710 is not set
CONFIG_SERIO_PARKBD=m
# CONFIG_SERIO_PCIPS2 is not set
CONFIG_SERIO_LIBPS2=y
CONFIG_SERIO_RAW=y
# CONFIG_GAMEPORT is not set

#
# Character devices
#
CONFIG_VT=y
CONFIG_CONSOLE_TRANSLATIONS=y
CONFIG_VT_CONSOLE=y
CONFIG_HW_CONSOLE=y
CONFIG_VT_HW_CONSOLE_BINDING=y
CONFIG_DEVKMEM=y
CONFIG_SERIAL_NONSTANDARD=y
CONFIG_COMPUTONE=y
# CONFIG_ROCKETPORT is not set
CONFIG_CYCLADES=m
# CONFIG_CYZ_INTR is not set
# CONFIG_DIGIEPCA is not set
CONFIG_MOXA_INTELLIO=m
# CONFIG_MOXA_SMARTIO is not set
# CONFIG_ISI is not set
# CONFIG_SYNCLINK is not set
CONFIG_SYNCLINKMP=y
# CONFIG_SYNCLINK_GT is not set
CONFIG_N_HDLC=m
# CONFIG_RISCOM8 is not set
CONFIG_SPECIALIX=y
CONFIG_SX=m
# CONFIG_RIO is not set
CONFIG_STALDRV=y
CONFIG_STALLION=y
# CONFIG_ISTALLION is not set
CONFIG_NOZOMI=m

#
# Serial drivers
#
CONFIG_SERIAL_8250=y
CONFIG_SERIAL_8250_CONSOLE=y
CONFIG_FIX_EARLYCON_MEM=y
# CONFIG_SERIAL_8250_PCI is not set
CONFIG_SERIAL_8250_CS=y
CONFIG_SERIAL_8250_NR_UARTS=4
CONFIG_SERIAL_8250_RUNTIME_UARTS=4
CONFIG_SERIAL_8250_EXTENDED=y
CONFIG_SERIAL_8250_MANY_PORTS=y
CONFIG_SERIAL_8250_SHARE_IRQ=y
CONFIG_SERIAL_8250_DETECT_IRQ=y
CONFIG_SERIAL_8250_RSA=y
CONFIG_SERIAL_8250_MCA=y

#
# Non-8250 serial port support
#
CONFIG_SERIAL_CORE=y
CONFIG_SERIAL_CORE_CONSOLE=y
# CONFIG_SERIAL_JSM is not set
CONFIG_UNIX98_PTYS=y
CONFIG_DEVPTS_MULTIPLE_INSTANCES=y
# CONFIG_LEGACY_PTYS is not set
CONFIG_PRINTER=y
CONFIG_LP_CONSOLE=y
CONFIG_PPDEV=y
CONFIG_HVC_DRIVER=y
CONFIG_VIRTIO_CONSOLE=y
# CONFIG_IPMI_HANDLER is not set
CONFIG_HW_RANDOM=y
CONFIG_HW_RANDOM_INTEL=m
CONFIG_HW_RANDOM_AMD=m
CONFIG_HW_RANDOM_GEODE=y
# CONFIG_HW_RANDOM_VIA is not set
CONFIG_HW_RANDOM_VIRTIO=m
CONFIG_NVRAM=y
# CONFIG_R3964 is not set
CONFIG_APPLICOM=m
CONFIG_SONYPI=m

#
# PCMCIA character devices
#
# CONFIG_SYNCLINK_CS is not set
CONFIG_CARDMAN_4000=m
# CONFIG_CARDMAN_4040 is not set
CONFIG_IPWIRELESS=m
CONFIG_MWAVE=y
CONFIG_SCx200_GPIO=m
CONFIG_PC8736x_GPIO=m
CONFIG_NSC_GPIO=m
CONFIG_CS5535_GPIO=y
# CONFIG_RAW_DRIVER is not set
# CONFIG_HANGCHECK_TIMER is not set
CONFIG_TCG_TPM=m
# CONFIG_TCG_NSC is not set
CONFIG_TCG_ATMEL=m
# CONFIG_TELCLOCK is not set
CONFIG_DEVPORT=y
CONFIG_I2C=y
CONFIG_I2C_BOARDINFO=y
CONFIG_I2C_CHARDEV=y
CONFIG_I2C_HELPER_AUTO=y
CONFIG_I2C_ALGOBIT=y
CONFIG_I2C_ALGOPCA=m

#
# I2C Hardware Bus support
#

#
# PC SMBus host controller drivers
#
CONFIG_I2C_ALI1535=m
CONFIG_I2C_ALI1563=m
CONFIG_I2C_ALI15X3=m
# CONFIG_I2C_AMD756 is not set
CONFIG_I2C_AMD8111=y
CONFIG_I2C_I801=y
# CONFIG_I2C_ISCH is not set
CONFIG_I2C_PIIX4=m
# CONFIG_I2C_NFORCE2 is not set
CONFIG_I2C_SIS5595=m
CONFIG_I2C_SIS630=m
CONFIG_I2C_SIS96X=y
# CONFIG_I2C_VIA is not set
CONFIG_I2C_VIAPRO=m

#
# I2C system bus drivers (mostly embedded / system-on-chip)
#
CONFIG_I2C_GPIO=m
CONFIG_I2C_OCORES=y
CONFIG_I2C_SIMTEC=y

#
# External I2C/SMBus adapter drivers
#
CONFIG_I2C_PARPORT=m
CONFIG_I2C_PARPORT_LIGHT=m
# CONFIG_I2C_TAOS_EVM is not set
CONFIG_I2C_TINY_USB=m

#
# Graphics adapter I2C/DDC channel drivers
#
CONFIG_I2C_VOODOO3=y

#
# Other I2C/SMBus bus drivers
#
CONFIG_I2C_PCA_PLATFORM=m
# CONFIG_I2C_STUB is not set
CONFIG_SCx200_I2C=m
CONFIG_SCx200_I2C_SCL=12
CONFIG_SCx200_I2C_SDA=13
# CONFIG_SCx200_ACB is not set

#
# Miscellaneous I2C Chip support
#
# CONFIG_DS1682 is not set
# CONFIG_SENSORS_PCA9539 is not set
CONFIG_SENSORS_PCF8591=y
CONFIG_SENSORS_MAX6875=y
# CONFIG_SENSORS_TSL2550 is not set
# CONFIG_I2C_DEBUG_CORE is not set
CONFIG_I2C_DEBUG_ALGO=y
# CONFIG_I2C_DEBUG_BUS is not set
CONFIG_I2C_DEBUG_CHIP=y
CONFIG_SPI=y
CONFIG_SPI_MASTER=y

#
# SPI Master Controller Drivers
#
CONFIG_SPI_BITBANG=y
CONFIG_SPI_BUTTERFLY=y
# CONFIG_SPI_GPIO is not set
CONFIG_SPI_LM70_LLP=y

#
# SPI Protocol Masters
#
# CONFIG_SPI_SPIDEV is not set
CONFIG_SPI_TLE62X0=m
CONFIG_ARCH_WANT_OPTIONAL_GPIOLIB=y
CONFIG_GPIOLIB=y
CONFIG_GPIO_SYSFS=y

#
# Memory mapped GPIO expanders:
#

#
# I2C GPIO expanders:
#
CONFIG_GPIO_MAX732X=y
# CONFIG_GPIO_PCA953X is not set
CONFIG_GPIO_PCF857X=y
# CONFIG_GPIO_TWL4030 is not set

#
# PCI GPIO expanders:
#
# CONFIG_GPIO_BT8XX is not set

#
# SPI GPIO expanders:
#
# CONFIG_GPIO_MAX7301 is not set
CONFIG_GPIO_MCP23S08=m
CONFIG_W1=m

#
# 1-wire Bus Masters
#
CONFIG_W1_MASTER_MATROX=m
CONFIG_W1_MASTER_DS2490=m
CONFIG_W1_MASTER_DS2482=m
CONFIG_W1_MASTER_GPIO=m

#
# 1-wire Slaves
#
CONFIG_W1_SLAVE_THERM=m
# CONFIG_W1_SLAVE_SMEM is not set
# CONFIG_W1_SLAVE_DS2431 is not set
# CONFIG_W1_SLAVE_DS2433 is not set
CONFIG_W1_SLAVE_DS2760=m
# CONFIG_W1_SLAVE_BQ27000 is not set
CONFIG_POWER_SUPPLY=y
CONFIG_POWER_SUPPLY_DEBUG=y
# CONFIG_PDA_POWER is not set
CONFIG_BATTERY_DS2760=m
CONFIG_BATTERY_BQ27x00=y
CONFIG_BATTERY_DA9030=m
CONFIG_HWMON=m
CONFIG_HWMON_VID=m
CONFIG_SENSORS_ABITUGURU=m
CONFIG_SENSORS_ABITUGURU3=m
CONFIG_SENSORS_AD7414=m
# CONFIG_SENSORS_AD7418 is not set
CONFIG_SENSORS_ADCXX=m
CONFIG_SENSORS_ADM1021=m
CONFIG_SENSORS_ADM1025=m
CONFIG_SENSORS_ADM1026=m
CONFIG_SENSORS_ADM1029=m
CONFIG_SENSORS_ADM1031=m
# CONFIG_SENSORS_ADM9240 is not set
# CONFIG_SENSORS_ADT7462 is not set
CONFIG_SENSORS_ADT7470=m
CONFIG_SENSORS_ADT7473=m
CONFIG_SENSORS_ADT7475=m
CONFIG_SENSORS_K8TEMP=m
# CONFIG_SENSORS_ASB100 is not set
CONFIG_SENSORS_ATXP1=m
CONFIG_SENSORS_DS1621=m
CONFIG_SENSORS_I5K_AMB=m
CONFIG_SENSORS_F71805F=m
CONFIG_SENSORS_F71882FG=m
# CONFIG_SENSORS_F75375S is not set
CONFIG_SENSORS_FSCHER=m
CONFIG_SENSORS_FSCPOS=m
# CONFIG_SENSORS_FSCHMD is not set
CONFIG_SENSORS_GL518SM=m
CONFIG_SENSORS_GL520SM=m
# CONFIG_SENSORS_CORETEMP is not set
# CONFIG_SENSORS_IT87 is not set
CONFIG_SENSORS_LM63=m
CONFIG_SENSORS_LM70=m
CONFIG_SENSORS_LM75=m
CONFIG_SENSORS_LM77=m
CONFIG_SENSORS_LM78=m
CONFIG_SENSORS_LM80=m
CONFIG_SENSORS_LM83=m
CONFIG_SENSORS_LM85=m
CONFIG_SENSORS_LM87=m
CONFIG_SENSORS_LTC4245=m
CONFIG_SENSORS_MAX1111=m
# CONFIG_SENSORS_MAX1619 is not set
# CONFIG_SENSORS_MAX6650 is not set
CONFIG_SENSORS_PC87360=m
CONFIG_SENSORS_PC87427=m
# CONFIG_SENSORS_SIS5595 is not set
CONFIG_SENSORS_DME1737=m
# CONFIG_SENSORS_SMSC47M1 is not set
CONFIG_SENSORS_SMSC47M192=m
CONFIG_SENSORS_SMSC47B397=m
CONFIG_SENSORS_ADS7828=m
# CONFIG_SENSORS_THMC50 is not set
CONFIG_SENSORS_VIA686A=m
CONFIG_SENSORS_VT1211=m
CONFIG_SENSORS_VT8231=m
CONFIG_SENSORS_W83781D=m
CONFIG_SENSORS_W83791D=m
CONFIG_SENSORS_W83792D=m
CONFIG_SENSORS_W83793=m
CONFIG_SENSORS_W83L785TS=m
CONFIG_SENSORS_W83L786NG=m
# CONFIG_SENSORS_W83627HF is not set
# CONFIG_SENSORS_W83627EHF is not set
CONFIG_SENSORS_HDAPS=m
CONFIG_SENSORS_APPLESMC=m
# CONFIG_HWMON_DEBUG_CHIP is not set
CONFIG_THERMAL=m
CONFIG_THERMAL_HWMON=y
# CONFIG_WATCHDOG is not set
CONFIG_SSB_POSSIBLE=y

#
# Sonics Silicon Backplane
#
# CONFIG_SSB is not set

#
# Multifunction device drivers
#
CONFIG_MFD_CORE=y
# CONFIG_MFD_SM501 is not set
# CONFIG_HTC_PASIC3 is not set
CONFIG_TPS65010=y
CONFIG_TWL4030_CORE=y
# CONFIG_MFD_TMIO is not set
CONFIG_PMIC_DA903X=y
CONFIG_MFD_WM8400=y
# CONFIG_MFD_PCF50633 is not set
CONFIG_REGULATOR=y
CONFIG_REGULATOR_DEBUG=y
# CONFIG_REGULATOR_FIXED_VOLTAGE is not set
# CONFIG_REGULATOR_VIRTUAL_CONSUMER is not set
CONFIG_REGULATOR_BQ24022=m
CONFIG_REGULATOR_WM8400=y
# CONFIG_REGULATOR_DA903X is not set

#
# Multimedia devices
#

#
# Multimedia core support
#
CONFIG_VIDEO_DEV=y
CONFIG_VIDEO_V4L2_COMMON=y
CONFIG_VIDEO_ALLOW_V4L1=y
CONFIG_VIDEO_V4L1_COMPAT=y
CONFIG_DVB_CORE=y
CONFIG_VIDEO_MEDIA=y

#
# Multimedia drivers
#
CONFIG_VIDEO_SAA7146=y
CONFIG_VIDEO_SAA7146_VV=y
CONFIG_MEDIA_ATTACH=y
CONFIG_MEDIA_TUNER=y
# CONFIG_MEDIA_TUNER_CUSTOMIZE is not set
CONFIG_MEDIA_TUNER_SIMPLE=y
CONFIG_MEDIA_TUNER_TDA8290=y
CONFIG_MEDIA_TUNER_TDA827X=m
CONFIG_MEDIA_TUNER_TDA18271=m
CONFIG_MEDIA_TUNER_TDA9887=y
CONFIG_MEDIA_TUNER_TEA5761=y
CONFIG_MEDIA_TUNER_TEA5767=y
CONFIG_MEDIA_TUNER_MT20XX=y
CONFIG_MEDIA_TUNER_MT2060=m
CONFIG_MEDIA_TUNER_MT2266=m
CONFIG_MEDIA_TUNER_QT1010=m
CONFIG_MEDIA_TUNER_XC2028=y
CONFIG_MEDIA_TUNER_XC5000=y
CONFIG_MEDIA_TUNER_MXL5005S=m
CONFIG_VIDEO_V4L2=y
CONFIG_VIDEO_V4L1=y
CONFIG_VIDEOBUF_GEN=y
CONFIG_VIDEOBUF_DMA_SG=y
# CONFIG_VIDEO_CAPTURE_DRIVERS is not set
CONFIG_RADIO_ADAPTERS=y
# CONFIG_RADIO_GEMTEK_PCI is not set
CONFIG_RADIO_MAXIRADIO=m
CONFIG_RADIO_MAESTRO=y
CONFIG_USB_DSBR=m
CONFIG_USB_SI470X=m
CONFIG_USB_MR800=y
# CONFIG_RADIO_TEA5764 is not set
CONFIG_DVB_DYNAMIC_MINORS=y
CONFIG_DVB_CAPTURE_DRIVERS=y

#
# Supported SAA7146 based PCI Adapters
#
CONFIG_TTPCI_EEPROM=y
CONFIG_DVB_AV7110=y
CONFIG_DVB_AV7110_OSD=y
CONFIG_DVB_BUDGET_CORE=y
# CONFIG_DVB_BUDGET is not set
# CONFIG_DVB_BUDGET_CI is not set
CONFIG_DVB_BUDGET_AV=y
CONFIG_DVB_BUDGET_PATCH=m

#
# Supported USB Adapters
#
CONFIG_DVB_USB=m
CONFIG_DVB_USB_DEBUG=y
CONFIG_DVB_USB_A800=m
CONFIG_DVB_USB_DIBUSB_MB=m
CONFIG_DVB_USB_DIBUSB_MB_FAULTY=y
CONFIG_DVB_USB_DIBUSB_MC=m
CONFIG_DVB_USB_DIB0700=m
CONFIG_DVB_USB_UMT_010=m
CONFIG_DVB_USB_CXUSB=m
CONFIG_DVB_USB_M920X=m
CONFIG_DVB_USB_GL861=m
# CONFIG_DVB_USB_AU6610 is not set
CONFIG_DVB_USB_DIGITV=m
# CONFIG_DVB_USB_VP7045 is not set
CONFIG_DVB_USB_VP702X=m
CONFIG_DVB_USB_GP8PSK=m
CONFIG_DVB_USB_NOVA_T_USB2=m
# CONFIG_DVB_USB_TTUSB2 is not set
# CONFIG_DVB_USB_DTT200U is not set
CONFIG_DVB_USB_OPERA1=m
# CONFIG_DVB_USB_DW2102 is not set
CONFIG_DVB_USB_CINERGY_T2=m
CONFIG_DVB_USB_ANYSEE=m
# CONFIG_DVB_USB_DTV5100 is not set
CONFIG_DVB_USB_AF9015=m
CONFIG_DVB_TTUSB_BUDGET=m
# CONFIG_DVB_TTUSB_DEC is not set
CONFIG_DVB_SIANO_SMS1XXX=m
CONFIG_DVB_SIANO_SMS1XXX_SMS_IDS=y

#
# Supported FlexCopII (B2C2) Adapters
#
CONFIG_DVB_B2C2_FLEXCOP=m
CONFIG_DVB_B2C2_FLEXCOP_PCI=m
CONFIG_DVB_B2C2_FLEXCOP_USB=m
# CONFIG_DVB_B2C2_FLEXCOP_DEBUG is not set

#
# Supported BT878 Adapters
#

#
# Supported Pluto2 Adapters
#
CONFIG_DVB_PLUTO2=y

#
# Supported SDMC DM1105 Adapters
#
CONFIG_DVB_DM1105=y

#
# Supported DVB Frontends
#

#
# Customise DVB Frontends
#
# CONFIG_DVB_FE_CUSTOMISE is not set

#
# Multistandard (satellite) frontends
#
CONFIG_DVB_STB0899=y
CONFIG_DVB_STB6100=m

#
# DVB-S (satellite) frontends
#
CONFIG_DVB_CX24110=m
CONFIG_DVB_CX24123=m
CONFIG_DVB_MT312=m
CONFIG_DVB_S5H1420=m
CONFIG_DVB_STV0288=y
CONFIG_DVB_STB6000=y
CONFIG_DVB_STV0299=y
CONFIG_DVB_TDA8083=y
# CONFIG_DVB_TDA10086 is not set
CONFIG_DVB_TDA8261=y
CONFIG_DVB_VES1X93=y
CONFIG_DVB_TUNER_ITD1000=m
CONFIG_DVB_TUNER_CX24113=m
CONFIG_DVB_TDA826X=y
CONFIG_DVB_TUA6100=y
CONFIG_DVB_CX24116=y
CONFIG_DVB_SI21XX=y

#
# DVB-T (terrestrial) frontends
#
CONFIG_DVB_SP8870=y
# CONFIG_DVB_SP887X is not set
CONFIG_DVB_CX22700=m
CONFIG_DVB_CX22702=m
CONFIG_DVB_DRX397XD=m
CONFIG_DVB_L64781=y
CONFIG_DVB_TDA1004X=y
CONFIG_DVB_NXT6000=y
CONFIG_DVB_MT352=m
CONFIG_DVB_ZL10353=m
CONFIG_DVB_DIB3000MB=y
CONFIG_DVB_DIB3000MC=y
CONFIG_DVB_DIB7000M=m
CONFIG_DVB_DIB7000P=y
CONFIG_DVB_TDA10048=m

#
# DVB-C (cable) frontends
#
CONFIG_DVB_VES1820=y
CONFIG_DVB_TDA10021=y
CONFIG_DVB_TDA10023=y
CONFIG_DVB_STV0297=y

#
# ATSC (North American/Korean Terrestrial/Cable DTV) frontends
#
CONFIG_DVB_NXT200X=m
CONFIG_DVB_OR51211=m
# CONFIG_DVB_OR51132 is not set
CONFIG_DVB_BCM3510=m
CONFIG_DVB_LGDT330X=m
CONFIG_DVB_LGDT3304=y
CONFIG_DVB_S5H1409=y
# CONFIG_DVB_AU8522 is not set
CONFIG_DVB_S5H1411=y

#
# ISDB-T (terrestrial) frontends
#
# CONFIG_DVB_S921 is not set

#
# Digital terrestrial only tuners/PLL
#
CONFIG_DVB_PLL=y
CONFIG_DVB_TUNER_DIB0070=m

#
# SEC control devices for DVB-S
#
CONFIG_DVB_LNBP21=y
CONFIG_DVB_ISL6405=m
CONFIG_DVB_ISL6421=y
CONFIG_DVB_LGS8GL5=m

#
# Tools to develop new frontends
#
CONFIG_DVB_DUMMY_FE=m
CONFIG_DVB_AF9013=y
CONFIG_DAB=y
CONFIG_USB_DABUSB=y

#
# Graphics support
#
CONFIG_AGP=y
CONFIG_AGP_ALI=m
# CONFIG_AGP_ATI is not set
CONFIG_AGP_AMD=m
CONFIG_AGP_AMD64=m
# CONFIG_AGP_INTEL is not set
# CONFIG_AGP_NVIDIA is not set
# CONFIG_AGP_SIS is not set
# CONFIG_AGP_SWORKS is not set
CONFIG_AGP_VIA=m
# CONFIG_AGP_EFFICEON is not set
CONFIG_DRM=y
CONFIG_DRM_TDFX=y
# CONFIG_DRM_R128 is not set
CONFIG_DRM_RADEON=y
# CONFIG_DRM_MGA is not set
CONFIG_DRM_SIS=m
CONFIG_DRM_VIA=y
CONFIG_DRM_SAVAGE=m
CONFIG_VGASTATE=m
CONFIG_VIDEO_OUTPUT_CONTROL=y
CONFIG_FB=m
# CONFIG_FIRMWARE_EDID is not set
CONFIG_FB_DDC=m
# CONFIG_FB_BOOT_VESA_SUPPORT is not set
CONFIG_FB_CFB_FILLRECT=m
CONFIG_FB_CFB_COPYAREA=m
CONFIG_FB_CFB_IMAGEBLIT=m
# CONFIG_FB_CFB_REV_PIXELS_IN_BYTE is not set
CONFIG_FB_SYS_FILLRECT=m
CONFIG_FB_SYS_COPYAREA=m
CONFIG_FB_SYS_IMAGEBLIT=m
CONFIG_FB_FOREIGN_ENDIAN=y
# CONFIG_FB_BOTH_ENDIAN is not set
CONFIG_FB_BIG_ENDIAN=y
# CONFIG_FB_LITTLE_ENDIAN is not set
CONFIG_FB_SYS_FOPS=m
CONFIG_FB_DEFERRED_IO=y
CONFIG_FB_HECUBA=m
CONFIG_FB_SVGALIB=m
# CONFIG_FB_MACMODES is not set
CONFIG_FB_BACKLIGHT=y
CONFIG_FB_MODE_HELPERS=y
CONFIG_FB_TILEBLITTING=y

#
# Frame buffer hardware drivers
#
CONFIG_FB_PM2=m
CONFIG_FB_PM2_FIFO_DISCONNECT=y
CONFIG_FB_CYBER2000=m
# CONFIG_FB_ARC is not set
CONFIG_FB_N411=m
# CONFIG_FB_HGA is not set
CONFIG_FB_S1D13XXX=m
CONFIG_FB_NVIDIA=m
CONFIG_FB_NVIDIA_I2C=y
CONFIG_FB_NVIDIA_DEBUG=y
CONFIG_FB_NVIDIA_BACKLIGHT=y
CONFIG_FB_RIVA=m
CONFIG_FB_RIVA_I2C=y
# CONFIG_FB_RIVA_DEBUG is not set
CONFIG_FB_RIVA_BACKLIGHT=y
CONFIG_FB_LE80578=m
CONFIG_FB_CARILLO_RANCH=m
CONFIG_FB_MATROX=m
CONFIG_FB_MATROX_MILLENIUM=y
CONFIG_FB_MATROX_MYSTIQUE=y
CONFIG_FB_MATROX_G=y
CONFIG_FB_MATROX_I2C=m
CONFIG_FB_MATROX_MAVEN=m
CONFIG_FB_MATROX_MULTIHEAD=y
CONFIG_FB_ATY128=m
CONFIG_FB_ATY128_BACKLIGHT=y
CONFIG_FB_ATY=m
CONFIG_FB_ATY_CT=y
# CONFIG_FB_ATY_GENERIC_LCD is not set
CONFIG_FB_ATY_GX=y
# CONFIG_FB_ATY_BACKLIGHT is not set
# CONFIG_FB_S3 is not set
CONFIG_FB_SAVAGE=m
CONFIG_FB_SAVAGE_I2C=y
# CONFIG_FB_SAVAGE_ACCEL is not set
# CONFIG_FB_SIS is not set
CONFIG_FB_VIA=m
# CONFIG_FB_NEOMAGIC is not set
# CONFIG_FB_KYRO is not set
# CONFIG_FB_3DFX is not set
CONFIG_FB_VOODOO1=m
# CONFIG_FB_VT8623 is not set
CONFIG_FB_TRIDENT=m
CONFIG_FB_TRIDENT_ACCEL=y
CONFIG_FB_ARK=m
CONFIG_FB_PM3=m
# CONFIG_FB_CARMINE is not set
CONFIG_FB_GEODE=y
CONFIG_FB_GEODE_LX=m
CONFIG_FB_GEODE_GX=m
# CONFIG_FB_GEODE_GX1 is not set
# CONFIG_FB_TMIO is not set
CONFIG_FB_METRONOME=m
CONFIG_FB_MB862XX=m
CONFIG_FB_MB862XX_PCI_GDC=y
CONFIG_BACKLIGHT_LCD_SUPPORT=y
# CONFIG_LCD_CLASS_DEVICE is not set
CONFIG_BACKLIGHT_CLASS_DEVICE=y
CONFIG_BACKLIGHT_GENERIC=m
CONFIG_BACKLIGHT_PROGEAR=y
CONFIG_BACKLIGHT_DA903X=y
CONFIG_BACKLIGHT_MBP_NVIDIA=y
# CONFIG_BACKLIGHT_SAHARA is not set

#
# Display device support
#
CONFIG_DISPLAY_SUPPORT=y

#
# Display hardware drivers
#

#
# Console display driver support
#
CONFIG_VGA_CONSOLE=y
# CONFIG_VGACON_SOFT_SCROLLBACK is not set
CONFIG_DUMMY_CONSOLE=y
CONFIG_LOGO=y
# CONFIG_LOGO_LINUX_MONO is not set
CONFIG_LOGO_LINUX_VGA16=y
CONFIG_LOGO_LINUX_CLUT224=y
# CONFIG_SOUND is not set
CONFIG_HID_SUPPORT=y
CONFIG_HID=y
CONFIG_HID_DEBUG=y
CONFIG_HIDRAW=y

#
# USB Input Devices
#
CONFIG_USB_HID=m
CONFIG_HID_PID=y
CONFIG_USB_HIDDEV=y

#
# USB HID Boot Protocol drivers
#
CONFIG_USB_KBD=y
CONFIG_USB_MOUSE=y

#
# Special HID drivers
#
CONFIG_HID_COMPAT=y
CONFIG_HID_A4TECH=m
CONFIG_HID_APPLE=m
# CONFIG_HID_BELKIN is not set
CONFIG_HID_CHERRY=m
CONFIG_HID_CHICONY=m
CONFIG_HID_CYPRESS=m
# CONFIG_HID_EZKEY is not set
CONFIG_HID_GYRATION=m
# CONFIG_HID_LOGITECH is not set
CONFIG_HID_MICROSOFT=m
CONFIG_HID_MONTEREY=m
# CONFIG_HID_NTRIG is not set
CONFIG_HID_PANTHERLORD=m
CONFIG_PANTHERLORD_FF=y
CONFIG_HID_PETALYNX=m
CONFIG_HID_SAMSUNG=m
CONFIG_HID_SONY=m
CONFIG_HID_SUNPLUS=m
CONFIG_GREENASIA_FF=m
CONFIG_HID_TOPSEED=m
CONFIG_THRUSTMASTER_FF=m
# CONFIG_ZEROPLUS_FF is not set
CONFIG_USB_SUPPORT=y
CONFIG_USB_ARCH_HAS_HCD=y
CONFIG_USB_ARCH_HAS_OHCI=y
CONFIG_USB_ARCH_HAS_EHCI=y
CONFIG_USB=y
CONFIG_USB_DEBUG=y
# CONFIG_USB_ANNOUNCE_NEW_DEVICES is not set

#
# Miscellaneous USB options
#
CONFIG_USB_DEVICEFS=y
# CONFIG_USB_DEVICE_CLASS is not set
CONFIG_USB_DYNAMIC_MINORS=y
# CONFIG_USB_OTG is not set
# CONFIG_USB_OTG_WHITELIST is not set
CONFIG_USB_OTG_BLACKLIST_HUB=y
CONFIG_USB_MON=y
CONFIG_USB_WUSB=y
CONFIG_USB_WUSB_CBAF=m
CONFIG_USB_WUSB_CBAF_DEBUG=y

#
# USB Host Controller Drivers
#
# CONFIG_USB_C67X00_HCD is not set
CONFIG_USB_EHCI_HCD=y
CONFIG_USB_EHCI_ROOT_HUB_TT=y
CONFIG_USB_EHCI_TT_NEWSCHED=y
CONFIG_USB_OXU210HP_HCD=y
CONFIG_USB_ISP116X_HCD=y
CONFIG_USB_ISP1760_HCD=m
CONFIG_USB_OHCI_HCD=y
# CONFIG_USB_OHCI_BIG_ENDIAN_DESC is not set
# CONFIG_USB_OHCI_BIG_ENDIAN_MMIO is not set
CONFIG_USB_OHCI_LITTLE_ENDIAN=y
CONFIG_USB_UHCI_HCD=y
CONFIG_USB_U132_HCD=m
# CONFIG_USB_SL811_HCD is not set
CONFIG_USB_R8A66597_HCD=y
CONFIG_USB_HWA_HCD=y

#
# USB Device Class drivers
#
CONFIG_USB_ACM=m
# CONFIG_USB_PRINTER is not set
CONFIG_USB_WDM=m
CONFIG_USB_TMC=m

#
# NOTE: USB_STORAGE depends on SCSI but BLK_DEV_SD may also be needed;
#

#
# see USB_STORAGE Help for more information
#
CONFIG_USB_STORAGE=m
CONFIG_USB_STORAGE_DEBUG=y
# CONFIG_USB_STORAGE_DATAFAB is not set
# CONFIG_USB_STORAGE_FREECOM is not set
CONFIG_USB_STORAGE_ISD200=y
# CONFIG_USB_STORAGE_USBAT is not set
# CONFIG_USB_STORAGE_SDDR09 is not set
CONFIG_USB_STORAGE_SDDR55=y
CONFIG_USB_STORAGE_JUMPSHOT=y
# CONFIG_USB_STORAGE_ALAUDA is not set
CONFIG_USB_STORAGE_ONETOUCH=y
# CONFIG_USB_STORAGE_KARMA is not set
CONFIG_USB_STORAGE_CYPRESS_ATACB=y
CONFIG_USB_LIBUSUAL=y

#
# USB Imaging devices
#
CONFIG_USB_MDC800=y
CONFIG_USB_MICROTEK=y

#
# USB port drivers
#
# CONFIG_USB_USS720 is not set
CONFIG_USB_SERIAL=m
CONFIG_USB_EZUSB=y
CONFIG_USB_SERIAL_GENERIC=y
CONFIG_USB_SERIAL_AIRCABLE=m
CONFIG_USB_SERIAL_ARK3116=m
CONFIG_USB_SERIAL_BELKIN=m
CONFIG_USB_SERIAL_CH341=m
# CONFIG_USB_SERIAL_WHITEHEAT is not set
CONFIG_USB_SERIAL_DIGI_ACCELEPORT=m
CONFIG_USB_SERIAL_CP2101=m
# CONFIG_USB_SERIAL_CYPRESS_M8 is not set
CONFIG_USB_SERIAL_EMPEG=m
CONFIG_USB_SERIAL_FTDI_SIO=m
CONFIG_USB_SERIAL_FUNSOFT=m
CONFIG_USB_SERIAL_VISOR=m
CONFIG_USB_SERIAL_IPAQ=m
# CONFIG_USB_SERIAL_IR is not set
CONFIG_USB_SERIAL_EDGEPORT=m
CONFIG_USB_SERIAL_EDGEPORT_TI=m
CONFIG_USB_SERIAL_GARMIN=m
# CONFIG_USB_SERIAL_IPW is not set
CONFIG_USB_SERIAL_IUU=m
CONFIG_USB_SERIAL_KEYSPAN_PDA=m
CONFIG_USB_SERIAL_KEYSPAN=m
# CONFIG_USB_SERIAL_KLSI is not set
CONFIG_USB_SERIAL_KOBIL_SCT=m
CONFIG_USB_SERIAL_MCT_U232=m
CONFIG_USB_SERIAL_MOS7720=m
# CONFIG_USB_SERIAL_MOS7840 is not set
# CONFIG_USB_SERIAL_MOTOROLA is not set
CONFIG_USB_SERIAL_NAVMAN=m
CONFIG_USB_SERIAL_PL2303=m
CONFIG_USB_SERIAL_OTI6858=m
# CONFIG_USB_SERIAL_SPCP8X5 is not set
CONFIG_USB_SERIAL_HP4X=m
CONFIG_USB_SERIAL_SAFE=m
# CONFIG_USB_SERIAL_SAFE_PADDED is not set
CONFIG_USB_SERIAL_SIEMENS_MPI=m
CONFIG_USB_SERIAL_SIERRAWIRELESS=m
CONFIG_USB_SERIAL_TI=m
# CONFIG_USB_SERIAL_CYBERJACK is not set
CONFIG_USB_SERIAL_XIRCOM=m
CONFIG_USB_SERIAL_OPTION=m
CONFIG_USB_SERIAL_OMNINET=m
# CONFIG_USB_SERIAL_OPTICON is not set
CONFIG_USB_SERIAL_DEBUG=m

#
# USB Miscellaneous drivers
#
CONFIG_USB_EMI62=y
CONFIG_USB_EMI26=m
# CONFIG_USB_ADUTUX is not set
CONFIG_USB_SEVSEG=m
CONFIG_USB_RIO500=m
CONFIG_USB_LEGOTOWER=m
CONFIG_USB_LCD=m
CONFIG_USB_BERRY_CHARGE=m
CONFIG_USB_LED=y
CONFIG_USB_CYPRESS_CY7C63=m
CONFIG_USB_CYTHERM=y
CONFIG_USB_PHIDGET=m
# CONFIG_USB_PHIDGETKIT is not set
CONFIG_USB_PHIDGETMOTORCONTROL=m
# CONFIG_USB_PHIDGETSERVO is not set
CONFIG_USB_IDMOUSE=y
CONFIG_USB_FTDI_ELAN=m
CONFIG_USB_APPLEDISPLAY=y
# CONFIG_USB_SISUSBVGA is not set
# CONFIG_USB_LD is not set
CONFIG_USB_TRANCEVIBRATOR=m
CONFIG_USB_IOWARRIOR=m
# CONFIG_USB_TEST is not set
# CONFIG_USB_ISIGHTFW is not set
CONFIG_USB_VST=y
# CONFIG_USB_ATM is not set

#
# OTG and related infrastructure
#
CONFIG_USB_OTG_UTILS=y
CONFIG_USB_GPIO_VBUS=y
CONFIG_TWL4030_USB=y
CONFIG_UWB=y
CONFIG_UWB_HWA=y
# CONFIG_UWB_WHCI is not set
CONFIG_UWB_WLP=y
CONFIG_UWB_I1480U=m
# CONFIG_UWB_I1480U_WLP is not set
CONFIG_MMC=m
CONFIG_MMC_DEBUG=y
CONFIG_MMC_UNSAFE_RESUME=y

#
# MMC/SD/SDIO Card Drivers
#
# CONFIG_MMC_BLOCK is not set
# CONFIG_SDIO_UART is not set
# CONFIG_MMC_TEST is not set

#
# MMC/SD/SDIO Host Controller Drivers
#
# CONFIG_MMC_SDHCI is not set
CONFIG_MMC_WBSD=m
CONFIG_MMC_TIFM_SD=m
CONFIG_MMC_SDRICOH_CS=m
CONFIG_MEMSTICK=m
CONFIG_MEMSTICK_DEBUG=y

#
# MemoryStick drivers
#
CONFIG_MEMSTICK_UNSAFE_RESUME=y
CONFIG_MSPRO_BLOCK=m

#
# MemoryStick Host Controller Drivers
#
CONFIG_MEMSTICK_TIFM_MS=m
CONFIG_MEMSTICK_JMICRON_38X=m
CONFIG_NEW_LEDS=y
CONFIG_LEDS_CLASS=m

#
# LED drivers
#
CONFIG_LEDS_NET48XX=m
CONFIG_LEDS_WRAP=m
CONFIG_LEDS_ALIX2=m
CONFIG_LEDS_PCA9532=m
# CONFIG_LEDS_GPIO is not set
CONFIG_LEDS_CLEVO_MAIL=m
# CONFIG_LEDS_PCA955X is not set
# CONFIG_LEDS_DA903X is not set

#
# LED Triggers
#
CONFIG_LEDS_TRIGGERS=y
# CONFIG_LEDS_TRIGGER_TIMER is not set
CONFIG_LEDS_TRIGGER_HEARTBEAT=y
CONFIG_LEDS_TRIGGER_BACKLIGHT=y
CONFIG_LEDS_TRIGGER_DEFAULT_ON=m
CONFIG_ACCESSIBILITY=y
CONFIG_A11Y_BRAILLE_CONSOLE=y
# CONFIG_EDAC is not set
CONFIG_RTC_LIB=m
CONFIG_RTC_CLASS=m

#
# RTC interfaces
#
# CONFIG_RTC_INTF_SYSFS is not set
CONFIG_RTC_INTF_PROC=y
CONFIG_RTC_INTF_DEV=y
CONFIG_RTC_INTF_DEV_UIE_EMUL=y
# CONFIG_RTC_DRV_TEST is not set

#
# I2C RTC drivers
#
CONFIG_RTC_DRV_DS1307=m
# CONFIG_RTC_DRV_DS1374 is not set
CONFIG_RTC_DRV_DS1672=m
# CONFIG_RTC_DRV_MAX6900 is not set
CONFIG_RTC_DRV_RS5C372=m
CONFIG_RTC_DRV_ISL1208=m
# CONFIG_RTC_DRV_X1205 is not set
CONFIG_RTC_DRV_PCF8563=m
CONFIG_RTC_DRV_PCF8583=m
# CONFIG_RTC_DRV_M41T80 is not set
# CONFIG_RTC_DRV_TWL4030 is not set
CONFIG_RTC_DRV_S35390A=m
# CONFIG_RTC_DRV_FM3130 is not set
# CONFIG_RTC_DRV_RX8581 is not set

#
# SPI RTC drivers
#
# CONFIG_RTC_DRV_M41T94 is not set
CONFIG_RTC_DRV_DS1305=m
CONFIG_RTC_DRV_DS1390=m
CONFIG_RTC_DRV_MAX6902=m
CONFIG_RTC_DRV_R9701=m
CONFIG_RTC_DRV_RS5C348=m
# CONFIG_RTC_DRV_DS3234 is not set

#
# Platform RTC drivers
#
# CONFIG_RTC_DRV_CMOS is not set
# CONFIG_RTC_DRV_DS1286 is not set
CONFIG_RTC_DRV_DS1511=m
CONFIG_RTC_DRV_DS1553=m
CONFIG_RTC_DRV_DS1742=m
CONFIG_RTC_DRV_STK17TA8=m
CONFIG_RTC_DRV_M48T86=m
# CONFIG_RTC_DRV_M48T35 is not set
CONFIG_RTC_DRV_M48T59=m
CONFIG_RTC_DRV_BQ4802=m
CONFIG_RTC_DRV_V3020=m

#
# on-CPU RTC drivers
#
# CONFIG_DMADEVICES is not set
CONFIG_AUXDISPLAY=y
CONFIG_UIO=y
# CONFIG_UIO_CIF is not set
# CONFIG_UIO_PDRV is not set
CONFIG_UIO_PDRV_GENIRQ=y
# CONFIG_UIO_SMX is not set
CONFIG_UIO_SERCOS3=m
CONFIG_X86_PLATFORM_DEVICES=y

#
# Firmware Drivers
#
CONFIG_EDD=m
# CONFIG_EDD_OFF is not set
# CONFIG_FIRMWARE_MEMMAP is not set
CONFIG_DELL_RBU=y
CONFIG_DCDBAS=y
CONFIG_DMIID=y
CONFIG_ISCSI_IBFT_FIND=y
CONFIG_ISCSI_IBFT=y

#
# File systems
#
CONFIG_EXT2_FS=y
CONFIG_EXT2_FS_XATTR=y
CONFIG_EXT2_FS_POSIX_ACL=y
CONFIG_EXT2_FS_SECURITY=y
CONFIG_EXT2_FS_XIP=y
CONFIG_EXT3_FS=y
CONFIG_EXT3_FS_XATTR=y
CONFIG_EXT3_FS_POSIX_ACL=y
CONFIG_EXT3_FS_SECURITY=y
CONFIG_EXT4_FS=m
CONFIG_EXT4DEV_COMPAT=y
# CONFIG_EXT4_FS_XATTR is not set
CONFIG_FS_XIP=y
CONFIG_JBD=y
CONFIG_JBD_DEBUG=y
CONFIG_JBD2=m
# CONFIG_JBD2_DEBUG is not set
CONFIG_FS_MBCACHE=y
CONFIG_REISERFS_FS=y
CONFIG_REISERFS_CHECK=y
CONFIG_REISERFS_PROC_INFO=y
# CONFIG_REISERFS_FS_XATTR is not set
CONFIG_JFS_FS=y
CONFIG_JFS_POSIX_ACL=y
# CONFIG_JFS_SECURITY is not set
CONFIG_JFS_DEBUG=y
CONFIG_JFS_STATISTICS=y
CONFIG_FS_POSIX_ACL=y
CONFIG_FILE_LOCKING=y
CONFIG_XFS_FS=m
CONFIG_XFS_QUOTA=y
CONFIG_XFS_POSIX_ACL=y
CONFIG_XFS_RT=y
CONFIG_XFS_DEBUG=y
# CONFIG_OCFS2_FS is not set
# CONFIG_BTRFS_FS is not set
# CONFIG_DNOTIFY is not set
# CONFIG_INOTIFY is not set
# CONFIG_QUOTA is not set
CONFIG_QUOTACTL=y
# CONFIG_AUTOFS_FS is not set
CONFIG_AUTOFS4_FS=y
# CONFIG_FUSE_FS is not set
CONFIG_GENERIC_ACL=y

#
# CD-ROM/DVD Filesystems
#
# CONFIG_ISO9660_FS is not set
CONFIG_UDF_FS=m
CONFIG_UDF_NLS=y

#
# DOS/FAT/NT Filesystems
#
CONFIG_FAT_FS=y
# CONFIG_MSDOS_FS is not set
CONFIG_VFAT_FS=y
CONFIG_FAT_DEFAULT_CODEPAGE=437
CONFIG_FAT_DEFAULT_IOCHARSET="iso8859-1"
# CONFIG_NTFS_FS is not set

#
# Pseudo filesystems
#
CONFIG_PROC_FS=y
CONFIG_PROC_KCORE=y
# CONFIG_PROC_VMCORE is not set
# CONFIG_PROC_SYSCTL is not set
# CONFIG_PROC_PAGE_MONITOR is not set
CONFIG_SYSFS=y
CONFIG_TMPFS=y
CONFIG_TMPFS_POSIX_ACL=y
CONFIG_HUGETLBFS=y
CONFIG_HUGETLB_PAGE=y
CONFIG_CONFIGFS_FS=y
CONFIG_MISC_FILESYSTEMS=y
CONFIG_ADFS_FS=y
CONFIG_ADFS_FS_RW=y
CONFIG_AFFS_FS=y
CONFIG_ECRYPT_FS=m
CONFIG_HFS_FS=m
CONFIG_HFSPLUS_FS=m
CONFIG_BEFS_FS=y
# CONFIG_BEFS_DEBUG is not set
CONFIG_BFS_FS=y
# CONFIG_EFS_FS is not set
# CONFIG_CRAMFS is not set
CONFIG_SQUASHFS=y
CONFIG_SQUASHFS_EMBEDDED=y
CONFIG_SQUASHFS_FRAGMENT_CACHE_SIZE=3
CONFIG_VXFS_FS=y
# CONFIG_MINIX_FS is not set
# CONFIG_OMFS_FS is not set
CONFIG_HPFS_FS=y
CONFIG_QNX4FS_FS=m
CONFIG_ROMFS_FS=m
# CONFIG_SYSV_FS is not set
# CONFIG_UFS_FS is not set
CONFIG_NETWORK_FILESYSTEMS=y
CONFIG_NFS_FS=m
# CONFIG_NFS_V3 is not set
CONFIG_NFS_V4=y
# CONFIG_NFSD is not set
CONFIG_LOCKD=m
CONFIG_EXPORTFS=m
CONFIG_NFS_COMMON=y
CONFIG_SUNRPC=m
CONFIG_SUNRPC_GSS=m
CONFIG_SUNRPC_REGISTER_V4=y
CONFIG_RPCSEC_GSS_KRB5=m
CONFIG_RPCSEC_GSS_SPKM3=m
# CONFIG_SMB_FS is not set
CONFIG_CIFS=y
CONFIG_CIFS_STATS=y
# CONFIG_CIFS_STATS2 is not set
# CONFIG_CIFS_WEAK_PW_HASH is not set
CONFIG_CIFS_UPCALL=y
CONFIG_CIFS_XATTR=y
# CONFIG_CIFS_POSIX is not set
CONFIG_CIFS_DEBUG2=y
CONFIG_CIFS_EXPERIMENTAL=y
CONFIG_CIFS_DFS_UPCALL=y
CONFIG_NCP_FS=m
# CONFIG_NCPFS_PACKET_SIGNING is not set
CONFIG_NCPFS_IOCTL_LOCKING=y
CONFIG_NCPFS_STRONG=y
CONFIG_NCPFS_NFS_NS=y
CONFIG_NCPFS_OS2_NS=y
# CONFIG_NCPFS_SMALLDOS is not set
CONFIG_NCPFS_NLS=y
CONFIG_NCPFS_EXTRAS=y
CONFIG_CODA_FS=m
CONFIG_AFS_FS=y
CONFIG_AFS_DEBUG=y

#
# Partition Types
#
CONFIG_PARTITION_ADVANCED=y
CONFIG_ACORN_PARTITION=y
CONFIG_ACORN_PARTITION_CUMANA=y
# CONFIG_ACORN_PARTITION_EESOX is not set
CONFIG_ACORN_PARTITION_ICS=y
# CONFIG_ACORN_PARTITION_ADFS is not set
CONFIG_ACORN_PARTITION_POWERTEC=y
# CONFIG_ACORN_PARTITION_RISCIX is not set
CONFIG_OSF_PARTITION=y
CONFIG_AMIGA_PARTITION=y
# CONFIG_ATARI_PARTITION is not set
# CONFIG_MAC_PARTITION is not set
CONFIG_MSDOS_PARTITION=y
CONFIG_BSD_DISKLABEL=y
CONFIG_MINIX_SUBPARTITION=y
CONFIG_SOLARIS_X86_PARTITION=y
CONFIG_UNIXWARE_DISKLABEL=y
# CONFIG_LDM_PARTITION is not set
# CONFIG_SGI_PARTITION is not set
CONFIG_ULTRIX_PARTITION=y
CONFIG_SUN_PARTITION=y
# CONFIG_KARMA_PARTITION is not set
# CONFIG_EFI_PARTITION is not set
CONFIG_SYSV68_PARTITION=y
CONFIG_NLS=y
CONFIG_NLS_DEFAULT="iso8859-1"
# CONFIG_NLS_CODEPAGE_437 is not set
# CONFIG_NLS_CODEPAGE_737 is not set
CONFIG_NLS_CODEPAGE_775=y
# CONFIG_NLS_CODEPAGE_850 is not set
# CONFIG_NLS_CODEPAGE_852 is not set
CONFIG_NLS_CODEPAGE_855=m
# CONFIG_NLS_CODEPAGE_857 is not set
CONFIG_NLS_CODEPAGE_860=m
CONFIG_NLS_CODEPAGE_861=m
CONFIG_NLS_CODEPAGE_862=m
# CONFIG_NLS_CODEPAGE_863 is not set
# CONFIG_NLS_CODEPAGE_864 is not set
# CONFIG_NLS_CODEPAGE_865 is not set
# CONFIG_NLS_CODEPAGE_866 is not set
# CONFIG_NLS_CODEPAGE_869 is not set
CONFIG_NLS_CODEPAGE_936=m
# CONFIG_NLS_CODEPAGE_950 is not set
CONFIG_NLS_CODEPAGE_932=m
CONFIG_NLS_CODEPAGE_949=y
CONFIG_NLS_CODEPAGE_874=y
CONFIG_NLS_ISO8859_8=m
CONFIG_NLS_CODEPAGE_1250=y
CONFIG_NLS_CODEPAGE_1251=m
CONFIG_NLS_ASCII=m
CONFIG_NLS_ISO8859_1=m
# CONFIG_NLS_ISO8859_2 is not set
# CONFIG_NLS_ISO8859_3 is not set
# CONFIG_NLS_ISO8859_4 is not set
CONFIG_NLS_ISO8859_5=m
# CONFIG_NLS_ISO8859_6 is not set
# CONFIG_NLS_ISO8859_7 is not set
# CONFIG_NLS_ISO8859_9 is not set
CONFIG_NLS_ISO8859_13=m
CONFIG_NLS_ISO8859_14=m
CONFIG_NLS_ISO8859_15=m
CONFIG_NLS_KOI8_R=y
CONFIG_NLS_KOI8_U=y
CONFIG_NLS_UTF8=m
# CONFIG_DLM is not set

#
# Kernel hacking
#
CONFIG_TRACE_IRQFLAGS_SUPPORT=y
# CONFIG_PRINTK_TIME is not set
# CONFIG_ALLOW_WARNINGS is not set
CONFIG_FRAME_WARN=1024
CONFIG_MAGIC_SYSRQ=y
# CONFIG_UNUSED_SYMBOLS is not set
CONFIG_DEBUG_FS=y
# CONFIG_HEADERS_CHECK is not set
CONFIG_DEBUG_SECTION_MISMATCH=y
# CONFIG_DEBUG_KERNEL is not set
CONFIG_SCHED_DEBUG=y
CONFIG_SCHEDSTATS=y
CONFIG_TRACE_IRQFLAGS=y
CONFIG_STACKTRACE=y
CONFIG_DEBUG_MEMORY_INIT=y
CONFIG_ARCH_WANT_FRAME_POINTERS=y
CONFIG_FRAME_POINTER=y
CONFIG_RCU_CPU_STALL_DETECTOR=y
CONFIG_LATENCYTOP=y
CONFIG_SYSCTL_SYSCALL_CHECK=y
CONFIG_USER_STACKTRACE_SUPPORT=y
CONFIG_NOP_TRACER=y
CONFIG_HAVE_FTRACE_NMI_ENTER=y
CONFIG_HAVE_FUNCTION_TRACER=y
CONFIG_HAVE_FUNCTION_GRAPH_TRACER=y
CONFIG_HAVE_FUNCTION_TRACE_MCOUNT_TEST=y
CONFIG_HAVE_DYNAMIC_FTRACE=y
CONFIG_HAVE_FTRACE_MCOUNT_RECORD=y
CONFIG_HAVE_FTRACE_SYSCALLS=y
CONFIG_TRACER_MAX_TRACE=y
CONFIG_RING_BUFFER=y
CONFIG_FTRACE_NMI_ENTER=y
CONFIG_TRACING=y
CONFIG_TRACING_SUPPORT=y

#
# Tracers
#
CONFIG_FUNCTION_TRACER=y
CONFIG_FUNCTION_GRAPH_TRACER=y
CONFIG_IRQSOFF_TRACER=y
CONFIG_SYSPROF_TRACER=y
# CONFIG_SCHED_TRACER is not set
CONFIG_CONTEXT_SWITCH_TRACER=y
CONFIG_EVENT_TRACER=y
# CONFIG_FTRACE_SYSCALLS is not set
CONFIG_BOOT_TRACER=y
CONFIG_POWER_TRACER=y
# CONFIG_STACK_TRACER is not set
CONFIG_KMEMTRACE=y
# CONFIG_WORKQUEUE_TRACER is not set
# CONFIG_BLK_DEV_IO_TRACE is not set
CONFIG_DYNAMIC_FTRACE=y
CONFIG_FTRACE_MCOUNT_RECORD=y
CONFIG_FTRACE_SELFTEST=y
CONFIG_FTRACE_STARTUP_TEST=y
CONFIG_MMIOTRACE=y
CONFIG_MMIOTRACE_TEST=m
CONFIG_PROVIDE_OHCI1394_DMA_INIT=y
CONFIG_FIREWIRE_OHCI_REMOTE_DMA=y
# CONFIG_DYNAMIC_PRINTK_DEBUG is not set
CONFIG_DMA_API_DEBUG=y
CONFIG_SAMPLES=y
CONFIG_SAMPLE_MARKERS=m
CONFIG_SAMPLE_TRACEPOINTS=m
# CONFIG_SAMPLE_KOBJECT is not set
# CONFIG_SAMPLE_KPROBES is not set
CONFIG_HAVE_ARCH_KGDB=y
CONFIG_HAVE_ARCH_KMEMCHECK=y
CONFIG_STRICT_DEVMEM=y
CONFIG_X86_VERBOSE_BOOTUP=y
CONFIG_EARLY_PRINTK=y
CONFIG_EARLY_PRINTK_DBGP=y
CONFIG_4KSTACKS=y
CONFIG_DOUBLEFAULT=y
CONFIG_HAVE_MMIOTRACE_SUPPORT=y
CONFIG_IO_DELAY_TYPE_0X80=0
CONFIG_IO_DELAY_TYPE_0XED=1
CONFIG_IO_DELAY_TYPE_UDELAY=2
CONFIG_IO_DELAY_TYPE_NONE=3
# CONFIG_IO_DELAY_0X80 is not set
CONFIG_IO_DELAY_0XED=y
# CONFIG_IO_DELAY_UDELAY is not set
# CONFIG_IO_DELAY_NONE is not set
CONFIG_DEFAULT_IO_DELAY_TYPE=1
# CONFIG_OPTIMIZE_INLINING is not set

#
# Security options
#
CONFIG_KEYS=y
# CONFIG_KEYS_DEBUG_PROC_KEYS is not set
CONFIG_SECURITY=y
CONFIG_SECURITYFS=y
# CONFIG_SECURITY_NETWORK is not set
CONFIG_SECURITY_PATH=y
CONFIG_SECURITY_FILE_CAPABILITIES=y
CONFIG_SECURITY_DEFAULT_MMAP_MIN_ADDR=0
CONFIG_CRYPTO=y

#
# Crypto core or helper
#
CONFIG_CRYPTO_FIPS=y
CONFIG_CRYPTO_ALGAPI=y
CONFIG_CRYPTO_ALGAPI2=y
CONFIG_CRYPTO_AEAD=y
CONFIG_CRYPTO_AEAD2=y
CONFIG_CRYPTO_BLKCIPHER=y
CONFIG_CRYPTO_BLKCIPHER2=y
CONFIG_CRYPTO_HASH=y
CONFIG_CRYPTO_HASH2=y
CONFIG_CRYPTO_RNG=m
CONFIG_CRYPTO_RNG2=y
CONFIG_CRYPTO_MANAGER=y
CONFIG_CRYPTO_MANAGER2=y
CONFIG_CRYPTO_GF128MUL=y
CONFIG_CRYPTO_NULL=m
# CONFIG_CRYPTO_CRYPTD is not set
CONFIG_CRYPTO_AUTHENC=y
# CONFIG_CRYPTO_TEST is not set

#
# Authenticated Encryption with Associated Data
#
# CONFIG_CRYPTO_CCM is not set
# CONFIG_CRYPTO_GCM is not set
CONFIG_CRYPTO_SEQIV=m

#
# Block modes
#
CONFIG_CRYPTO_CBC=y
CONFIG_CRYPTO_CTR=m
CONFIG_CRYPTO_CTS=y
CONFIG_CRYPTO_ECB=y
CONFIG_CRYPTO_LRW=y
CONFIG_CRYPTO_PCBC=y
CONFIG_CRYPTO_XTS=m

#
# Hash modes
#
CONFIG_CRYPTO_HMAC=y
CONFIG_CRYPTO_XCBC=m

#
# Digest
#
CONFIG_CRYPTO_CRC32C=y
CONFIG_CRYPTO_CRC32C_INTEL=y
CONFIG_CRYPTO_MD4=y
CONFIG_CRYPTO_MD5=y
CONFIG_CRYPTO_MICHAEL_MIC=m
CONFIG_CRYPTO_RMD128=y
# CONFIG_CRYPTO_RMD160 is not set
CONFIG_CRYPTO_RMD256=y
# CONFIG_CRYPTO_RMD320 is not set
CONFIG_CRYPTO_SHA1=y
CONFIG_CRYPTO_SHA256=m
CONFIG_CRYPTO_SHA512=m
# CONFIG_CRYPTO_TGR192 is not set
CONFIG_CRYPTO_WP512=y

#
# Ciphers
#
CONFIG_CRYPTO_AES=y
# CONFIG_CRYPTO_AES_586 is not set
# CONFIG_CRYPTO_ANUBIS is not set
CONFIG_CRYPTO_ARC4=m
CONFIG_CRYPTO_BLOWFISH=m
CONFIG_CRYPTO_CAMELLIA=y
CONFIG_CRYPTO_CAST5=y
CONFIG_CRYPTO_CAST6=y
CONFIG_CRYPTO_DES=y
CONFIG_CRYPTO_FCRYPT=y
CONFIG_CRYPTO_KHAZAD=y
CONFIG_CRYPTO_SALSA20=m
CONFIG_CRYPTO_SALSA20_586=y
CONFIG_CRYPTO_SEED=m
# CONFIG_CRYPTO_SERPENT is not set
CONFIG_CRYPTO_TEA=y
# CONFIG_CRYPTO_TWOFISH is not set
CONFIG_CRYPTO_TWOFISH_COMMON=y
CONFIG_CRYPTO_TWOFISH_586=y

#
# Compression
#
CONFIG_CRYPTO_DEFLATE=y
CONFIG_CRYPTO_LZO=m

#
# Random Number Generation
#
CONFIG_CRYPTO_ANSI_CPRNG=m
CONFIG_CRYPTO_HW=y
CONFIG_CRYPTO_DEV_PADLOCK=m
CONFIG_CRYPTO_DEV_PADLOCK_AES=m
CONFIG_CRYPTO_DEV_PADLOCK_SHA=m
# CONFIG_CRYPTO_DEV_GEODE is not set
CONFIG_CRYPTO_DEV_HIFN_795X=y
CONFIG_CRYPTO_DEV_HIFN_795X_RNG=y
CONFIG_HAVE_KVM=y
CONFIG_VIRTUALIZATION=y
CONFIG_KVM=y
CONFIG_KVM_INTEL=m
CONFIG_KVM_AMD=y
# CONFIG_KVM_TRACE is not set
CONFIG_LGUEST=y
CONFIG_VIRTIO=y
CONFIG_VIRTIO_RING=y
CONFIG_VIRTIO_PCI=y
CONFIG_VIRTIO_BALLOON=y
CONFIG_BINARY_PRINTF=y

#
# Library routines
#
CONFIG_BITREVERSE=y
CONFIG_GENERIC_FIND_FIRST_BIT=y
CONFIG_GENERIC_FIND_NEXT_BIT=y
CONFIG_GENERIC_FIND_LAST_BIT=y
CONFIG_CRC_CCITT=m
CONFIG_CRC16=m
CONFIG_CRC_T10DIF=y
CONFIG_CRC_ITU_T=y
CONFIG_CRC32=y
CONFIG_CRC7=m
# CONFIG_LIBCRC32C is not set
CONFIG_AUDIT_GENERIC=y
CONFIG_ZLIB_INFLATE=y
CONFIG_ZLIB_DEFLATE=y
CONFIG_LZO_COMPRESS=m
CONFIG_LZO_DECOMPRESS=m
CONFIG_DECOMPRESS_GZIP=y
CONFIG_HAS_IOMEM=y
CONFIG_HAS_IOPORT=y
CONFIG_HAS_DMA=y
CONFIG_CHECK_SIGNATURE=y
CONFIG_FORCE_SUCCESSFUL_BUILD=y
CONFIG_FORCE_MINIMAL_CONFIG=y
CONFIG_FORCE_MINIMAL_CONFIG_PHYS=y
CONFIG_X86_32_ALWAYS_ON=y

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-24 13:51                 ` Theodore Tso
@ 2009-03-24 16:34                   ` Jesper Krogh
  2009-03-24 17:32                     ` Linus Torvalds
  2009-03-24 18:20                   ` Mark Lord
  1 sibling, 1 reply; 664+ messages in thread
From: Jesper Krogh @ 2009-03-24 16:34 UTC (permalink / raw)
  To: Theodore Tso, Ingo Molnar, Alan Cox, Arjan van de Ven,
	Andrew Morton, Peter Zijlstra, Nick Piggin, Jens Axboe,
	David Rees, Jesper Krogh, Linus Torvalds,
	Linux Kernel Mailing List

Theodore Tso wrote:
> On Tue, Mar 24, 2009 at 02:30:11PM +0100, Ingo Molnar wrote:
>> i think the problem became visible via the rise in memory size, 
>> combined with the non-improvement of the performance of rotational 
>> disks.
>>
>> The disk speed versus RAM size ratio has become dramatically worse - 
>> and our "5% of RAM" dirty ratio on a 32 GB box is 1.6 GB - which 
>> takes an eternity to write out if you happen to sync on that. When 
>> we had 1 GB of RAM 5% meant 51 MB - one or two seconds to flush out 
>> - and worse than that, chances are that it's spread out widely on 
>> the disk, the whole thing becoming seek-limited as well.
> 
> That's definitely a problem too, but keep in mind that by default the
> journal gets committed every 5 seconds, so the data gets flushed out
> that often.  So the question is how quickly can you *dirty* 1.6GB of
> memory?

Say it's a file that you allready have in memory cache read in.. there
is plenty of space in 16GB for that.. then you can dirty it at 
memory-speed.. that about ½sec. (correct me if I'm wrong).

Ok, this is probably unrealistic, but memory grows the largest we have
at the moment is 32GB and its steadily growing with the core-counts.
Then the available memory is used to cache the "active" portion of the
filsystems. I would even say that in the NFS-servers I depend on it to
do this efficiently. (2.6.29-rc8 delivered 1050MB/s over af 10GbitE 
using nfsd - send speed to multiple clients).

The current workload is based of an active dataset of 600GB where
index'es are being generated and written back to the same disk. So
there is a fairly high read/write load on the machine (as you said was 
required). The majority (perhaps 550GB ) is only read once where the
rest of the time it is stuff in the last 50GB being rewritten.

> "dd if=/dev/zero of=/u1/dirty-me-harder" will certainly do it, but
> normally we're doing something useful, and so you're either copying
> data from local disk, at which point you're limited by the read speed
> of your local disk (I suppose it could be in cache, but how common of
> a case is that?), 

Increasingly the case as memory sizes grows.

 > *or*, you're copying from the network, and to copy
> in 1.6GB of data in 5 seconds, that means you're moving 320
> megabytes/second, which if we're copying in the data from the network,
> requires a 10 gigabit ethernet.

or just around being processed on the 16-32 cores on the system.


Jesper
-- 
Jesper

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-24 16:34                   ` Jesper Krogh
@ 2009-03-24 17:32                     ` Linus Torvalds
  0 siblings, 0 replies; 664+ messages in thread
From: Linus Torvalds @ 2009-03-24 17:32 UTC (permalink / raw)
  To: Jesper Krogh
  Cc: Theodore Tso, Ingo Molnar, Alan Cox, Arjan van de Ven,
	Andrew Morton, Peter Zijlstra, Nick Piggin, Jens Axboe,
	David Rees, Linux Kernel Mailing List



On Tue, 24 Mar 2009, Jesper Krogh wrote:
>
> Theodore Tso wrote:
> > That's definitely a problem too, but keep in mind that by default the
> > journal gets committed every 5 seconds, so the data gets flushed out
> > that often.  So the question is how quickly can you *dirty* 1.6GB of
> > memory?

Doesn't at least ext4 default to the _insane_ model of "data is less 
important than meta-data, and it doesn't get journalled"?

And ext3 with "data=writeback" does the same, no?

Both of which are - as far as I can tell - total braindamage. At least 
with ext3 it's not the _default_ mode.

I never understood how anybody doing filesystems (especially ones that 
claim to be crash-resistant due to journalling) would _ever_ accept the 
"writeback" behavior of having "clean fsck, but data loss".

> Say it's a file that you allready have in memory cache read in.. there
> is plenty of space in 16GB for that.. then you can dirty it at memory-speed..
> that about ½sec. (correct me if I'm wrong).

No, you'll still have to get per-page locks etc. If you use mmap(), you'll 
page-fault on each page, if you use write() you'll do all the page lookups 
etc. But yes, it can be pretty quick - the biggest cost probably _will_ be 
the speed of memory itself (doing one-byte writes at each block would 
change that, and the bottle-neck would become the system call and page 
lookup/locking path, but it's probably in the same rough cost as cost of 
writing out one page one page).

That said, this is all why we now have 'dirty_*bytes' limits too. 

The problem is that the dirty_[background_]bytes value really should be 
scaled up by the speed of IO. And we currently have no way to do that. 
Some machines can write a gigabyte in a second with some fancy RAID 
setups. Others will take minutes (or hours) to do that (crappy SSD's that 
get 25kB/s throughput on random writes).

The "dirty_[background_ratio" percentage doesn't scale up by the speed of 
IO either, of course, but at least historically there was generally a 
pretty good correlation between amount of memory and speed of IO. The 
machines that had gigs and gigs of RAM tended to always have fast IO too.  
So scaling up dirty limits by memory size made sense both in the "we have 
tons of memory, so allow tons of it to be dirty" sense _and_ in the "we 
likely have a fast disk, so allow more pending dirty data".

				Linus

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-24 13:52               ` Alan Cox
  2009-03-24 14:28                 ` Theodore Tso
@ 2009-03-24 17:55                 ` Jan Kara
  1 sibling, 0 replies; 664+ messages in thread
From: Jan Kara @ 2009-03-24 17:55 UTC (permalink / raw)
  To: Alan Cox
  Cc: Theodore Tso, Ingo Molnar, Arjan van de Ven, Andrew Morton,
	Peter Zijlstra, Nick Piggin, Jens Axboe, David Rees,
	Jesper Krogh, Linus Torvalds, Linux Kernel Mailing List

> > They don't solve the problem where there is a *huge* amount of writes
> > going on, though --- if something is dirtying pages at a rate far
> 
> At very high rates other things seem to go pear shaped. I've not traced
> it back far enough to be sure but what I suspect occurs from the I/O at
> disk level is that two people are writing stuff out at once - presumably
> the vm paging pressure and the file system - as I see two streams of I/O
> that are each reasonably ordered but are interleaved.
  There are different problems leading to this:
1) JBD commit code writes ordered data on each transaction commit. This
is done in dirtied-time order which is not necessarily optimal in case
of random access IO. IO scheduler helps here though because we submit a
lot of IO at once. ext4 has at least the randomness part of this problem
"fixed" because it submits ordered data via writepages(). Doing this
change requires non-trivial changes to the journaling layer so I wasn't
brave enough to do it with ext3 and JBD as well (although porting the
patch is trivial).

2) When we do dirty throttling, there are going to be several threads
writing out on the filesystem (if you have more pdflush threads which
translates to having more than one CPU). Jens' per-BDI writeback
threads could help here (but I haven't yet got to reading his patches in
detail to be sure).

  These two problems together result in non-optimal IO pattern. At least
that's where I got to when I was looking into why Berkeley DB is so
slow. I was trying to somehow serialize more pdflush threads on the
filesystem but a stupid solution does not really help much - either I
was starving some throttled thread by other threads doing writeback or
I didn't quite keep the disk busy. So something like Jens' approach
is probably the way to go in the end.

> > don't get *that* bad, even with ext3.  At least, I haven't found a
> > workload that doesn't involve either dd if=/dev/zero or a massive
> > amount of data coming in over the network that will cause fsync()
> > delays in the > 1-2 second category.  Ext3 has been around for a long
> 
> I see it with a desktop when it pages hard and also when doing heavy
> desktop I/O (in my case the repeatable every time case is saving large
> images in the gimp - A4 at 600-1200dpi).
> 
> The other one (#8636) seems to be a bug in the I/O schedulers as it goes
> away if you use a different I/O sched.
> 
> > solve.  Simply mounting an ext3 filesystem using ext4, without making
> > any change to the filesystem format, should solve the problem.
> 
> I will try this experiment but not with production data just yet 8)
> 
> > some other users' data files.  This was the reason for Stephen Tweedie
> > implementing the data=ordered mode, and making it the default.
> 
> Yes and in the server environment or for typical enterprise customers
> this is a *big issue*, especially the risk of it being undetected that
> they just inadvertently did something like put your medical data into the
> end of something public during a crash.
> 
> > Try ext4, I think you'll like it.  :-)
> 
> I need to, so that I can double check none of the open jbd locking bugs
> are there and close more bugzilla entries (#8147)
  This one is still there. I'll have a look at it tomorrow and hopefully
will be able to answer...

									Honza

-- 
Jan Kara <jack@suse.cz>
SuSE CR Labs

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-24 13:20             ` Theodore Tso
  2009-03-24 13:30               ` Ingo Molnar
  2009-03-24 13:52               ` Alan Cox
@ 2009-03-24 17:55               ` Linus Torvalds
  2009-03-24 18:41                 ` Kyle Moffett
  2009-03-24 18:45                 ` Theodore Tso
  2009-03-24 20:24               ` David Rees
  2009-03-24 23:03               ` Jesse Barnes
  4 siblings, 2 replies; 664+ messages in thread
From: Linus Torvalds @ 2009-03-24 17:55 UTC (permalink / raw)
  To: Theodore Tso
  Cc: Ingo Molnar, Alan Cox, Arjan van de Ven, Andrew Morton,
	Peter Zijlstra, Nick Piggin, Jens Axboe, David Rees,
	Jesper Krogh, Linux Kernel Mailing List



On Tue, 24 Mar 2009, Theodore Tso wrote:
> 
> Try ext4, I think you'll like it.  :-)
> 
> Failing that, data=writeback for single-user machines is probably your
> best bet.

Isn't that the same fix? ext4 just defaults to the crappy "writeback" 
behavior, which is insane.

Sure, it makes things _much_ smoother, since now the actual data is no 
longer in the critical path for any journal writes, but anybody who thinks 
that's a solution is just incompetent. 

We might as well go back to ext2 then. If your data gets written out long 
after the metadata hit the disk, you are going to hit all kinds of bad 
issues if the machine ever goes down.

			Linus

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-24 13:51                 ` Theodore Tso
  2009-03-24 16:34                   ` Jesper Krogh
@ 2009-03-24 18:20                   ` Mark Lord
  2009-03-24 18:41                     ` Eric Sandeen
  1 sibling, 1 reply; 664+ messages in thread
From: Mark Lord @ 2009-03-24 18:20 UTC (permalink / raw)
  To: Theodore Tso, Ingo Molnar, Alan Cox, Arjan van de Ven,
	Andrew Morton, Peter Zijlstra, Nick Piggin, Jens Axboe,
	David Rees, Jesper Krogh, Linus Torvalds,
	Linux Kernel Mailing List

Theodore Tso wrote:
> So the question is how quickly can you *dirty* 1.6GB of memory?
..

MythTV:   rm /some/really/huge/video/file ; sync
          ## disk light stays on for several minutes..

Note quite the same thing, I suppose, but it does break
the shutdown scripts of every major Linux distribution.

Simple solution for MythTV is what people already do: use xfs instead.

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-24 18:20                   ` Mark Lord
@ 2009-03-24 18:41                     ` Eric Sandeen
  0 siblings, 0 replies; 664+ messages in thread
From: Eric Sandeen @ 2009-03-24 18:41 UTC (permalink / raw)
  To: Mark Lord
  Cc: Theodore Tso, Ingo Molnar, Alan Cox, Arjan van de Ven,
	Andrew Morton, Peter Zijlstra, Nick Piggin, Jens Axboe,
	David Rees, Jesper Krogh, Linus Torvalds,
	Linux Kernel Mailing List

Mark Lord wrote:
> Theodore Tso wrote:
>> So the question is how quickly can you *dirty* 1.6GB of memory?
> ..
> 
> MythTV:   rm /some/really/huge/video/file ; sync
>           ## disk light stays on for several minutes..
> 
> Note quite the same thing, I suppose, but it does break
> the shutdown scripts of every major Linux distribution.

It is indeed a different issue.  ext3 does a fair bit of IO on a (here
60G file) delete:

http://people.redhat.com/~esandeen/rm_test/ext3_rm.png

ext4 is much better:

http://people.redhat.com/~esandeen/rm_test/ext4_rm.png

> Simple solution for MythTV is what people already do: use xfs instead.

and yes, xfs does it very quickly:

http://people.redhat.com/~esandeen/rm_test/xfs_rm.png

-Eric

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-24 17:55               ` Linus Torvalds
@ 2009-03-24 18:41                 ` Kyle Moffett
  2009-03-24 19:17                   ` Linus Torvalds
  2009-03-24 18:45                 ` Theodore Tso
  1 sibling, 1 reply; 664+ messages in thread
From: Kyle Moffett @ 2009-03-24 18:41 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Theodore Tso, Ingo Molnar, Alan Cox, Arjan van de Ven,
	Andrew Morton, Peter Zijlstra, Nick Piggin, Jens Axboe,
	David Rees, Jesper Krogh, Linux Kernel Mailing List

On Tue, Mar 24, 2009 at 1:55 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Tue, 24 Mar 2009, Theodore Tso wrote:
>> Try ext4, I think you'll like it.  :-)
>>
>> Failing that, data=writeback for single-user machines is probably your
>> best bet.
>
> Isn't that the same fix? ext4 just defaults to the crappy "writeback"
> behavior, which is insane.
>
> Sure, it makes things _much_ smoother, since now the actual data is no
> longer in the critical path for any journal writes, but anybody who thinks
> that's a solution is just incompetent.
>
> We might as well go back to ext2 then. If your data gets written out long
> after the metadata hit the disk, you are going to hit all kinds of bad
> issues if the machine ever goes down.

Not really...

Regardless of any journalling, a power-fail or a crash is almost
certainly going to cause "data loss" of some variety.  We simply
didn't get to sync everything we needed to (otherwise we'd all be
shutting down our computers with the SCRAM switches just for kicks).
The difference is, with ext3/4 (in any journal mode) we guarantee our
metadata is consistent.  This means that we won't double-allocate or
leak inodes or blocks, which means that we can safely *write* to the
filesystem as soon as we replay the journal.  With ext2 you *CAN'T* do
that at all, as somebody may have allocated an inode but not yet
marked it as in use.  The only way to safely figure all that out
without journalling is an fsck run.

That difference between ext4 and ext3-in-writeback-mode is this:  If
you get a crash in the narrow window *after* writing initial metadata
and before writing the data, ext4 will give you a zero length file,
whereas ext3-in-writeback-mode will give you a proper-length file
filled with whatever used to be on disk (might be the contents of a
previous /etc/shadow, or maybe somebody's finance files).

In that same situation, ext3 in data-ordered or data-journal mode will
"close" the window by preventing anybody else from making forward
progress until the data and the metadata are both updated.  The thing
is, even on ext3 I can get exactly the same kind of behavior with an
appropriately timed "kill -STOP $dumb_program", followed by a power
failure 60 seconds later.  It's a relatively obvious race condition...

When you create a file, you can't guarantee that all of that file's
data and metadata has hit disk until after an fsync() call returns.
The only *possible* exceptions are in cases like the
previously-mentioned (and now patched)
open(A)+write(A)+close(A)+rename(A,B), where the
rename-over-existing-file should act as an implicit filesystem
barrier.  It should ensure that all writes to the file get flushed
before it is renamed on top of an existing file, simply because so
much UNIX software expects it to act that way.

When you're dealing with programs that simply
open()+ftruncate()+write()+close(), however... there's always going to
be a window in-between the ftruncate and the write where the file *is*
an empty file, and in that case no amount of operating-system-level
cleverness can deal with application-level bugs.

Cheers,
Kyle Moffett

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-24 17:55               ` Linus Torvalds
  2009-03-24 18:41                 ` Kyle Moffett
@ 2009-03-24 18:45                 ` Theodore Tso
  2009-03-24 19:21                   ` Linus Torvalds
  1 sibling, 1 reply; 664+ messages in thread
From: Theodore Tso @ 2009-03-24 18:45 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ingo Molnar, Alan Cox, Arjan van de Ven, Andrew Morton,
	Peter Zijlstra, Nick Piggin, Jens Axboe, David Rees,
	Jesper Krogh, Linux Kernel Mailing List

On Tue, Mar 24, 2009 at 10:55:40AM -0700, Linus Torvalds wrote:
> 
> 
> On Tue, 24 Mar 2009, Theodore Tso wrote:
> > 
> > Try ext4, I think you'll like it.  :-)
> > 
> > Failing that, data=writeback for single-user machines is probably your
> > best bet.
> 
> Isn't that the same fix? ext4 just defaults to the crappy "writeback" 
> behavior, which is insane.

Technically, it's not data=writeback.  It's more like XFS's delayed
allocation; I've added workarounds so that files that which are
replaced via truncate or rename get pushed out right away, which
should solve most of the problems involved with files becoming
zero-length after a system crash.

> Sure, it makes things _much_ smoother, since now the actual data is no 
> longer in the critical path for any journal writes, but anybody who thinks 
> that's a solution is just incompetent. 
> 
> We might as well go back to ext2 then. If your data gets written out long 
> after the metadata hit the disk, you are going to hit all kinds of bad 
> issues if the machine ever goes down.

With ext2 after a system crash you need to run fsck.  With ext4, fsck
isn't an issue, but if the application doesn't use fsync(), yes,
there's no guarantee (other than the workarounds for
replace-via-truncate and replace-via-rename), but there's plenty of
prior history that says that applications that care about data hitting
the disk should use fsync().  Otherwise, it will get spread out over a
few minutes; and for some files, that really won't make a difference.

For precious files, applications that use fsync() will be safe ---
otherwise, even with ext3, you can end up losing the contents of the
file if you crash right before 5 second commit window.  At least back
in the days when people were proud of their Linux systems having 2-3
year uptimes, and where jiffies could actually wrap from time to time,
the difference between 5 seconds and 3 minutes really wasn't that big
of a deal.  People who really care about this can turn off delayed
allocation with the nodelalloc mount option.  Of course then they will
have the ext3 slower fsync() problem.

You are right that data=writeback and delayed allocation do both mean
that data can get pushed out much later than the metadata.  But that's
allowed by POSIX, and it does give some very nice performance
benefits.

With either data=writeback or delayed allocation, we can also adjust
the default commit interval and the writeback timer settings; if we
say, change the default commit interval to be 30 seconds, and change
the writeback expire interval to be 15 seconds, it will also smooth
out the writes significantly.  So that's yet another solution, with a
different set of tradeoffs.  

Depending on the set of applications someone is running on their
system, running and the reliability of their hardware/power/system in
general, different tradeoffs will be more or less appropriate for the
system administrator in question.

							- Ted



^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-24  7:32     ` Jesper Krogh
  2009-03-24  8:16       ` Ingo Molnar
@ 2009-03-24 19:00       ` David Rees
  2009-03-25 17:42         ` Jesper Krogh
  2009-03-25 18:30         ` Theodore Tso
  1 sibling, 2 replies; 664+ messages in thread
From: David Rees @ 2009-03-24 19:00 UTC (permalink / raw)
  To: Jesper Krogh; +Cc: Linus Torvalds, Linux Kernel Mailing List

On Tue, Mar 24, 2009 at 12:32 AM, Jesper Krogh <jesper@krogh.cc> wrote:
> David Rees wrote:
> The 480 secondes is not the "wait time" but the time gone before the
> message is printed. It the kernel-default it was earlier 120 seconds but
> thats changed by Ingo Molnar back in september. I do get a lot of less
> noise but it really doesn't tell anything about the nature of the problem.
>
> The systes spec:
> 32GB of memory. The disks are a Nexsan SataBeast with 42 SATA drives in
> Raid10 connected using 4Gbit fibre-channel. I'll let it up to you to decide
> if thats fast or slow?

The drives should be fast enough to saturate 4Gbit FC in streaming
writes.  How fast is the array in practice?

> The strange thing is actually that the above process (updatedb.mlocate) is
> writing to / which is a device without any activity at all. All activity is
> on the Fibre Channel device above, but process writing outsid that seems to
> be effected as well.

Ah.  Sounds like your setup would benefit immensely from the per-bdi
patches from Jens Axobe.  I'm sure he would appreciate some feedback
from users like you on them.

>> What's your vm.dirty_background_ratio and
>>
>> vm.dirty_ratio set to?
>
> 2.6.29-rc8 defaults:
> jk@hest:/proc/sys/vm$ cat dirty_background_ratio
> 5
> jk@hest:/proc/sys/vm$ cat dirty_ratio
> 10

On a 32GB system that's 1.6GB of dirty data, but your array should be
able to write that out fairly quickly (in a couple seconds) as long as
it's not too random.  If it's spread all over the disk, write
throughput will drop significantly - how fast is data being written to
disk when your system suffers from large write latency?

>>> Consensus seems to be something with large memory machines, lots of dirty
>>> pages and a long writeout time due to ext3.
>>
>> All filesystems seem to suffer from this issue to some degree.  I
>> posted to the list earlier trying to see if there was anything that
>> could be done to help my specific case.  I've got a system where if
>> someone starts writing out a large file, it kills client NFS writes.
>> Makes the system unusable:
>> http://marc.info/?l=linux-kernel&m=123732127919368&w=2
>
> Yes, I've hit 120s+ penalties just by saving a file in vim.

Yeah, your disks aren't keeping up and/or data isn't being written out
efficiently.

>> Only workaround I've found is to reduce dirty_background_ratio and
>> dirty_ratio to tiny levels.  Or throw good SSDs and/or a fast RAID
>> array at it so that large writes complete faster.  Have you tried the
>> new vm_dirty_bytes in 2.6.29?
>
> No.. What would you suggest to be a reasonable setting for that?

Look at whatever is there by default and try cutting them in half to start.

>> Everyone seems to agree that "autotuning" it is the way to go.  But no
>> one seems willing to step up and try to do it.  Probably because it's
>> hard to get right!
>
> I can test patches.. but I'm not a kernel-developer.. unfortunately.

Me either - but luckily there have been plenty chiming in on this thread now.

-Dave

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-24 18:41                 ` Kyle Moffett
@ 2009-03-24 19:17                   ` Linus Torvalds
  0 siblings, 0 replies; 664+ messages in thread
From: Linus Torvalds @ 2009-03-24 19:17 UTC (permalink / raw)
  To: Kyle Moffett
  Cc: Theodore Tso, Ingo Molnar, Alan Cox, Arjan van de Ven,
	Andrew Morton, Peter Zijlstra, Nick Piggin, Jens Axboe,
	David Rees, Jesper Krogh, Linux Kernel Mailing List



On Tue, 24 Mar 2009, Kyle Moffett wrote:
> 
> Regardless of any journalling, a power-fail or a crash is almost
> certainly going to cause "data loss" of some variety.

The point is, if you write your metadata earlier (say, every 5 sec) and 
the real data later (say, every 30 sec), you're actually MORE LIKELY to 
see corrupt files than if you try to write them together.

And if you write your data _first_, you're never going to see corruption 
at all.

This is why I absolutely _detest_ the idiotic ext3 writeback behavior. It 
literally does everything the wrong way around - writing data later than 
the metadata that points to it. Whoever came up with that solution was a 
moron. No ifs, buts, or maybes about it.

			Linus

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Revert "gro: Fix legacy path napi_complete crash", (was: Re: Linux 2.6.29)
  2009-03-24 16:02               ` Ingo Molnar
@ 2009-03-24 19:19                 ` Ingo Molnar
  2009-03-24 20:54                   ` Ingo Molnar
  2009-03-25  0:33                   ` Revert "gro: Fix legacy path napi_complete crash", (was: Re: Linux 2.6.29) Herbert Xu
  2009-03-25  0:32                 ` Herbert Xu
  1 sibling, 2 replies; 664+ messages in thread
From: Ingo Molnar @ 2009-03-24 19:19 UTC (permalink / raw)
  To: Herbert Xu
  Cc: Robert Schwebel, Linus Torvalds, Frank Blaschka, David S. Miller,
	Thomas Gleixner, Peter Zijlstra, Linux Kernel Mailing List,
	kernel


* Ingo Molnar <mingo@elte.hu> wrote:

> * Herbert Xu <herbert@gondor.apana.org.au> wrote:
> 
> > On Tue, Mar 24, 2009 at 04:47:17PM +0100, Ingo Molnar wrote:
> > > 
> > > test failure on one of the boxes, interface got stuck after ~100K 
> > > packets:
> > > 
> > > eth1      Link encap:Ethernet  HWaddr 00:13:D4:DC:41:12  
> > >           inet addr:10.0.1.13  Bcast:10.0.1.255  Mask:255.255.255.0
> > >           inet6 addr: fe80::213:d4ff:fedc:4112/64 Scope:Link
> > >           UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
> > >           RX packets:22555 errors:0 dropped:0 overruns:0 frame:0
> > >           TX packets:1897 errors:0 dropped:0 overruns:0 carrier:0
> > >           collisions:0 txqueuelen:1000 
> > >           RX bytes:2435071 (2.3 MiB)  TX bytes:503790 (491.9 KiB)
> > >           Interrupt:11 Base address:0x4000 
> > 
> > What's the NIC and config on this one? If it's still using the 
> > legacy/netif_rx path, where GRO is off by default, this patch 
> > should make it exactly the same as with my original patch 
> > reverted.
> 
> Same forcedeth box i reported before. Config below. (note: if you 
> want to use it you need to run it through 'make oldconfig', with 
> all defaults accepted)

Hm, i justhad a test failure (hung interface) with this too.

I'll go back to the original straight revert of "303c6a0: gro: Fix 
legacy path napi_complete crash", and will test it overnight - to 
establish a baseline of stability again. (to make sure there are no 
other bugs interacting)

	Ingo

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-24 18:45                 ` Theodore Tso
@ 2009-03-24 19:21                   ` Linus Torvalds
  2009-03-24 19:40                     ` Ric Wheeler
  2009-03-24 19:55                     ` Jeff Garzik
  0 siblings, 2 replies; 664+ messages in thread
From: Linus Torvalds @ 2009-03-24 19:21 UTC (permalink / raw)
  To: Theodore Tso
  Cc: Ingo Molnar, Alan Cox, Arjan van de Ven, Andrew Morton,
	Peter Zijlstra, Nick Piggin, Jens Axboe, David Rees,
	Jesper Krogh, Linux Kernel Mailing List



On Tue, 24 Mar 2009, Theodore Tso wrote:
>
> With ext2 after a system crash you need to run fsck.  With ext4, fsck
> isn't an issue,

Bah. A corrupt filesystem is a corrupt filesystem. Whether you have to 
fsck it or not should be a secondary concern.

I personally find silent corruption to be _worse_ than the non-silent one. 
At least if there's some program that says "oops, your inode so-and-so 
seems to be scrogged" that's better than just silently having bad data in 
it.

Of course, never having bad data _nor_ needing fsck is clearly optimal. 
data=ordered gets pretty close (and data=journal is unacceptable for 
performance reasons).

But I really don't understand filesystem people who think that "fsck" is 
the important part, regardless of whether the data is valid or not. That's 
just stupid and _obviously_ bogus.

			Linus

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-24 19:21                   ` Linus Torvalds
@ 2009-03-24 19:40                     ` Ric Wheeler
  2009-03-24 19:55                     ` Jeff Garzik
  1 sibling, 0 replies; 664+ messages in thread
From: Ric Wheeler @ 2009-03-24 19:40 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Theodore Tso, Ingo Molnar, Alan Cox, Arjan van de Ven,
	Andrew Morton, Peter Zijlstra, Nick Piggin, Jens Axboe,
	David Rees, Jesper Krogh, Linux Kernel Mailing List

Linus Torvalds wrote:
> On Tue, 24 Mar 2009, Theodore Tso wrote:
>   
>> With ext2 after a system crash you need to run fsck.  With ext4, fsck
>> isn't an issue,
>>     
>
> Bah. A corrupt filesystem is a corrupt filesystem. Whether you have to 
> fsck it or not should be a secondary concern.
>
> I personally find silent corruption to be _worse_ than the non-silent one. 
> At least if there's some program that says "oops, your inode so-and-so 
> seems to be scrogged" that's better than just silently having bad data in 
> it.
>
> Of course, never having bad data _nor_ needing fsck is clearly optimal. 
> data=ordered gets pretty close (and data=journal is unacceptable for 
> performance reasons).
>
> But I really don't understand filesystem people who think that "fsck" is 
> the important part, regardless of whether the data is valid or not. That's 
> just stupid and _obviously_ bogus.
>
> 			Linus
>   
It is always interesting to try to explain to users that just because 
fsck ran cleanly does not mean anything that they care about is actually 
safely on disk. The speed that fsck can run at is important when you are 
trying to recover data from a really hosed file system, but that is 
thankfully relatively rare for most people.

Having been involved in many calls with customers after crashes, what 
they really want to know is pretty routine - do you have all of the data 
I wrote? can you prove that it is the same data that I wrote? if not, 
what data is missing and needs to be restored?

We can get help answer those questions with checksums or digital hashes 
to validate the actual user data of files (open question is when to 
compute it, where to store, would the SCSI T10 DIF/DIX stuff be 
sufficient), putting in place some background scrubbers to detect 
corruptions (which can happen even without an IO error), etc.

Being able to pin point what was impacted is actually enormously useful 
- for example, being able to map a bad sector back into some meaningful 
object like a user file, meta-data (translation, run fsck) or so on. 

Ric




^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-24 19:21                   ` Linus Torvalds
  2009-03-24 19:40                     ` Ric Wheeler
@ 2009-03-24 19:55                     ` Jeff Garzik
  2009-03-25  9:34                       ` Benny Halevy
  2009-03-25  9:39                       ` Jens Axboe
  1 sibling, 2 replies; 664+ messages in thread
From: Jeff Garzik @ 2009-03-24 19:55 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Theodore Tso, Ingo Molnar, Alan Cox, Arjan van de Ven,
	Andrew Morton, Peter Zijlstra, Nick Piggin, Jens Axboe,
	David Rees, Jesper Krogh, Linux Kernel Mailing List

Linus Torvalds wrote:
> But I really don't understand filesystem people who think that "fsck" is 
> the important part, regardless of whether the data is valid or not. That's 
> just stupid and _obviously_ bogus.

I think I can understand that point of view, at least:

More customers complain about hours-long fsck times than they do about 
silent data corruption of non-fsync'd files.


> The point is, if you write your metadata earlier (say, every 5 sec) and 
> the real data later (say, every 30 sec), you're actually MORE LIKELY to 
> see corrupt files than if you try to write them together.
> 
> And if you write your data _first_, you're never going to see corruption 
> at all.

Amen.

And, personal filesystem pet peeve:  please encourage proper FLUSH CACHE 
use to give users the data guarantees they deserve.  Linux's sync(2) and 
fsync(2) (and fdatasync, etc.) should poke the block layer to guarantee 
a media write.

	Jeff


P.S.  Overall, I am thrilled that this ext3/ext4 transition and 
associated slashdotting has spurred debate over filesystem data 
guarantees.  This is the kind of discussion that has needed to happen 
for years, IMO.




^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-24 13:20             ` Theodore Tso
                                 ` (2 preceding siblings ...)
  2009-03-24 17:55               ` Linus Torvalds
@ 2009-03-24 20:24               ` David Rees
  2009-03-25  7:30                 ` David Rees
  2009-03-24 23:03               ` Jesse Barnes
  4 siblings, 1 reply; 664+ messages in thread
From: David Rees @ 2009-03-24 20:24 UTC (permalink / raw)
  To: Theodore Tso, Ingo Molnar, Alan Cox, Arjan van de Ven,
	Andrew Morton, Peter Zijlstra, Nick Piggin, Jens Axboe,
	David Rees, Jesper Krogh, Linus Torvalds,
	Linux Kernel Mailing List

On Tue, Mar 24, 2009 at 6:20 AM, Theodore Tso <tytso@mit.edu> wrote:
> However, what I've found, though, is that if you're just doing a local
> copy from one hard drive to another, or downloading a huge iso file
> from an ftp server over a wide area network, the fsync() delays really
> don't get *that* bad, even with ext3.  At least, I haven't found a
> workload that doesn't involve either dd if=/dev/zero or a massive
> amount of data coming in over the network that will cause fsync()
> delays in the > 1-2 second category.  Ext3 has been around for a long
> time, and it's only been the last couple of years that people have
> really complained about this; my theory is that it was the rise of >
> 10 megabit ethernets and the use of systems like distcc that really
> made this problem really become visible.  The only realistic workload
> I've found that triggers this requires a fast network dumping data to
> a local filesystem.

It's pretty easy to reproduce it these days.  Here's my setup, and
it's not even that fancy:  Dual core Xeon, 8GB RAM, SATA RAID1 array,
GigE network.  All it takes is a single client writing a large file
using Samba or NFS to introduce huge latencies.

Looking at the raw throughput, the server's disks can sustain
30-60MB/s writes (older disks), but the network can handle up to
~100MB/s.  Throw in some other random seeky IO on the server, a bunch
of fragmentation and it's sustained write throughput in reality for
these writes is more like 10-25MB/s, far slower than the rate at which
a client can throw data at it.

5% dirty_ratrio * 8GB is 400MB.  Let's say in reality the system is
flushing 20MB/s to disk, this is a delay of up to 20 seconds.  Let's
say you have a user application which needs to fsync a number of small
files (and unfortunately they are done serially) and now I've got
applications (like Firefox) which basically remain unresponsive the
entire time the write is being done.

> (I'm sure someone will be ingeniuous enough to find something else
> though, and if they're interested, I've attached an fsync latency
> tester to this note.  If you find something; let me know, I'd be
> interested.)

Thanks - I'll give the program a shot later with my test case and see
what it reports.  My simple test case[1] for reproducing this has
reported 6-45 seconds depending on the system.  I'll try it with the
previously mentioned workload as well.

-Dave

[1] http://bugzilla.kernel.org/show_bug.cgi?id=12309#c249

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Revert "gro: Fix legacy path napi_complete crash", (was: Re: Linux 2.6.29)
  2009-03-24 19:19                 ` Ingo Molnar
@ 2009-03-24 20:54                   ` Ingo Molnar
  2009-03-24 21:17                     ` Revert "gro: Fix legacy path napi_complete crash", David Miller
  2009-03-25  0:33                   ` Revert "gro: Fix legacy path napi_complete crash", (was: Re: Linux 2.6.29) Herbert Xu
  1 sibling, 1 reply; 664+ messages in thread
From: Ingo Molnar @ 2009-03-24 20:54 UTC (permalink / raw)
  To: Herbert Xu
  Cc: Robert Schwebel, Linus Torvalds, Frank Blaschka, David S. Miller,
	Thomas Gleixner, Peter Zijlstra, Linux Kernel Mailing List,
	kernel


* Ingo Molnar <mingo@elte.hu> wrote:

> > Same forcedeth box i reported before. Config below. (note: if 
> > you want to use it you need to run it through 'make oldconfig', 
> > with all defaults accepted)
> 
> Hm, i just had a test failure (hung interface) with this too.
> 
> I'll go back to the original straight revert of "303c6a0: gro: Fix 
> legacy path napi_complete crash", and will test it overnight - to 
> establish a baseline of stability again. (to make sure there are 
> no other bugs interacting)

FYI, this plain revert is holding up fine in my tests so far - 50 
random iterations - the previous one failed after 5 iterations.

	Ingo

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Revert "gro: Fix legacy path napi_complete crash",
  2009-03-24 20:54                   ` Ingo Molnar
@ 2009-03-24 21:17                     ` David Miller
  2009-03-24 22:01                       ` Ingo Molnar
  0 siblings, 1 reply; 664+ messages in thread
From: David Miller @ 2009-03-24 21:17 UTC (permalink / raw)
  To: mingo
  Cc: herbert, r.schwebel, torvalds, blaschka, tglx, a.p.zijlstra,
	linux-kernel, kernel

From: Ingo Molnar <mingo@elte.hu>
Date: Tue, 24 Mar 2009 21:54:44 +0100

> * Ingo Molnar <mingo@elte.hu> wrote:
> 
> > > Same forcedeth box i reported before. Config below. (note: if 
> > > you want to use it you need to run it through 'make oldconfig', 
> > > with all defaults accepted)
> > 
> > Hm, i just had a test failure (hung interface) with this too.
> > 
> > I'll go back to the original straight revert of "303c6a0: gro: Fix 
> > legacy path napi_complete crash", and will test it overnight - to 
> > establish a baseline of stability again. (to make sure there are 
> > no other bugs interacting)
> 
> FYI, this plain revert is holding up fine in my tests so far - 50 
> random iterations - the previous one failed after 5 iterations.

Something must be up with respect to letting interrupts in during
certain windows of time, or similar.

I'll take a look at this and hopefully Herbert or myself will be
able to figure it out.

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Revert "gro: Fix legacy path napi_complete crash",
  2009-03-24 15:09       ` Herbert Xu
  2009-03-24 15:29         ` Sascha Hauer
  2009-03-24 15:36         ` Ingo Molnar
@ 2009-03-24 21:36         ` David Miller
  2009-03-24 22:47           ` David Miller
  2009-03-25  0:23           ` Herbert Xu
  2 siblings, 2 replies; 664+ messages in thread
From: David Miller @ 2009-03-24 21:36 UTC (permalink / raw)
  To: herbert
  Cc: mingo, r.schwebel, torvalds, blaschka, tglx, a.p.zijlstra,
	linux-kernel, kernel

From: Herbert Xu <herbert@gondor.apana.org.au>
Date: Tue, 24 Mar 2009 23:09:28 +0800

> On Tue, Mar 24, 2009 at 03:39:42PM +0100, Ingo Molnar wrote:
> >
> > Subject: [PATCH] net: Fix netpoll lockup in legacy receive path
> 
> Actually, this patch is still racy.  If some interrupt comes in
> and we suddenly get the maximum amount of backlog we can still
> hang when we call __napi_complete incorrectly.  It's unlikely
> but we certainly shouldn't allow that.  Here's a better version.
> 
> net: Fix netpoll lockup in legacy receive path

Hmmm...

> @@ -2588,9 +2588,10 @@ static int process_backlog(struct napi_struct *napi, int quota)
>  		local_irq_disable();
>  		skb = __skb_dequeue(&queue->input_pkt_queue);
>  		if (!skb) {
> +			list_del(&napi->poll_list);
> +			clear_bit(NAPI_STATE_SCHED, &napi->state);
>  			local_irq_enable();
> -			napi_complete(napi);
> -			goto out;
> +			break;
>  		}
>  		local_irq_enable();

I think the problem is that we need to do the GRO flush before the
list delete and clearing the NAPI_STATE_SCHED bit.

You can't disown the NAPI context until you've squared away the GRO
state, I think.

Ingo's case stresses TCP a lot so I think he's hitting these GRO
cases a lot as well as hitting the backlog maximum.

So this mis-ordering of completion operations could explain why
he still sees problems.

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Revert "gro: Fix legacy path napi_complete crash",
  2009-03-24 21:17                     ` Revert "gro: Fix legacy path napi_complete crash", David Miller
@ 2009-03-24 22:01                       ` Ingo Molnar
  0 siblings, 0 replies; 664+ messages in thread
From: Ingo Molnar @ 2009-03-24 22:01 UTC (permalink / raw)
  To: David Miller
  Cc: herbert, r.schwebel, torvalds, blaschka, tglx, a.p.zijlstra,
	linux-kernel, kernel


* David Miller <davem@davemloft.net> wrote:

> From: Ingo Molnar <mingo@elte.hu>
> Date: Tue, 24 Mar 2009 21:54:44 +0100
> 
> > * Ingo Molnar <mingo@elte.hu> wrote:
> > 
> > > > Same forcedeth box i reported before. Config below. (note: if 
> > > > you want to use it you need to run it through 'make oldconfig', 
> > > > with all defaults accepted)
> > > 
> > > Hm, i just had a test failure (hung interface) with this too.
> > > 
> > > I'll go back to the original straight revert of "303c6a0: gro: Fix 
> > > legacy path napi_complete crash", and will test it overnight - to 
> > > establish a baseline of stability again. (to make sure there are 
> > > no other bugs interacting)
> > 
> > FYI, this plain revert is holding up fine in my tests so far - 50 
> > random iterations - the previous one failed after 5 iterations.
> 
> Something must be up with respect to letting interrupts in during 
> certain windows of time, or similar.
> 
> I'll take a look at this and hopefully Herbert or myself will be 
> able to figure it out.

It definitely did not show usual patterns of bug behavior - i'd have 
found it yesterday morning if it did.

I spent most of the time trying to find a reliable reproducer 
.config and system. Sometimes the bug went away with a minor change 
in the .config. Until today i didnt even suspect a mainline change 
causing this.

Also, note that i have reduced the probability of UP kernels in my 
randconfigs artificially to about 12.5% (it is 50% upstream). Still, 
despite that measure, the 'best' .config i found was an UP config - 
i dont think that's an accident. Also, i had to fully saturate the 
target CPU over gigabit to hit the bug best.

Which suggests to me (empirically) that it's indeed a race and that 
it needs a saturated system with lots of IRQs to trigger, and 
perhaps that it needs saturated/overloaded network device queues and 
complex userspace/softirq/hardirq interactions.

	Ingo

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Revert "gro: Fix legacy path napi_complete crash",
  2009-03-24 21:36         ` David Miller
@ 2009-03-24 22:47           ` David Miller
  2009-03-25  0:24             ` Herbert Xu
  2009-03-25  0:23           ` Herbert Xu
  1 sibling, 1 reply; 664+ messages in thread
From: David Miller @ 2009-03-24 22:47 UTC (permalink / raw)
  To: herbert
  Cc: mingo, r.schwebel, torvalds, blaschka, tglx, a.p.zijlstra,
	linux-kernel, kernel

From: David Miller <davem@davemloft.net>
Date: Tue, 24 Mar 2009 14:36:22 -0700 (PDT)

> From: Herbert Xu <herbert@gondor.apana.org.au>
> Date: Tue, 24 Mar 2009 23:09:28 +0800
> 
> > @@ -2588,9 +2588,10 @@ static int process_backlog(struct napi_struct *napi, int quota)
> >  		local_irq_disable();
> >  		skb = __skb_dequeue(&queue->input_pkt_queue);
> >  		if (!skb) {
> > +			list_del(&napi->poll_list);
> > +			clear_bit(NAPI_STATE_SCHED, &napi->state);
> >  			local_irq_enable();
> > -			napi_complete(napi);
> > -			goto out;
> > +			break;
> >  		}
> >  		local_irq_enable();
> 
> I think the problem is that we need to do the GRO flush before the
> list delete and clearing the NAPI_STATE_SCHED bit.

Ok Herbert, I'm even more sure of this because in your original commit
log message you mention:

	This simply doesn't work since we need to flush the held
	GRO packets first.

We are certainly in a pickle here, actually.

We can't run the GRO flush until we re-enable interrupts.  But if we
re-enable interrupts, more packets get queued to the input_pkt_queue
and we end up back where we started.

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-24 13:20             ` Theodore Tso
                                 ` (3 preceding siblings ...)
  2009-03-24 20:24               ` David Rees
@ 2009-03-24 23:03               ` Jesse Barnes
  2009-03-25  0:05                 ` Arjan van de Ven
                                   ` (2 more replies)
  4 siblings, 3 replies; 664+ messages in thread
From: Jesse Barnes @ 2009-03-24 23:03 UTC (permalink / raw)
  To: Theodore Tso
  Cc: Ingo Molnar, Alan Cox, Arjan van de Ven, Andrew Morton,
	Peter Zijlstra, Nick Piggin, Jens Axboe, David Rees,
	Jesper Krogh, Linus Torvalds, Linux Kernel Mailing List

On Tue, 24 Mar 2009 09:20:32 -0400
Theodore Tso <tytso@mit.edu> wrote:
> They don't solve the problem where there is a *huge* amount of writes
> going on, though --- if something is dirtying pages at a rate far
> greater than the local disk can write it out, say, either "dd
> if=/dev/zero of=/mnt/make-lots-of-writes" or a massive distcc cluster
> driving a huge amount of data towards a single system or a wget over a
> local 100 megabit ethernet from a massive NFS server where everything
> is in cache, then you can have a major delay with the fsync().

You make it sound like this is hard to do...  I was running into this
problem *every day* until I moved to XFS recently.  I'm running a
fairly beefy desktop (VMware running a crappy Windows install w/AV junk
on it, builds, icecream and large mailboxes) and have a lot of RAM, but
it became unusable for minutes at a time, which was just totally
unacceptable, thus the switch.  Things have been better since, but are
still a little choppy.

I remember early in the 2.6.x days there was a lot of focus on making
interactive performance good, and for a long time it was.  But this I/O
problem has been around for a *long* time now... What happened?  Do not
many people run into this daily?  Do all the filesystem hackers run
with special mount options to mitigate the problem?

-- 
Jesse Barnes, Intel Open Source Technology Center

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-24 23:03               ` Jesse Barnes
@ 2009-03-25  0:05                 ` Arjan van de Ven
  2009-03-25 17:59                   ` David Rees
  2009-03-25 18:40                   ` Stephen Clark
  2009-03-25  2:09                 ` Theodore Tso
  2009-03-27 11:27                 ` Martin Steigerwald
  2 siblings, 2 replies; 664+ messages in thread
From: Arjan van de Ven @ 2009-03-25  0:05 UTC (permalink / raw)
  To: Jesse Barnes
  Cc: Theodore Tso, Ingo Molnar, Alan Cox, Andrew Morton,
	Peter Zijlstra, Nick Piggin, Jens Axboe, David Rees,
	Jesper Krogh, Linus Torvalds, Linux Kernel Mailing List

On Tue, 24 Mar 2009 16:03:53 -0700
Jesse Barnes <jbarnes@virtuousgeek.org> wrote:

> 
> I remember early in the 2.6.x days there was a lot of focus on making
> interactive performance good, and for a long time it was.  But this
> I/O problem has been around for a *long* time now... What happened?
> Do not many people run into this daily?  Do all the filesystem
> hackers run with special mount options to mitigate the problem?
> 

the people that care use my kernel patch on ext3 ;-)
(or the userland equivalent tweak in /etc/rc.local)



-- 
Arjan van de Ven 	Intel Open Source Technology Centre
For development, discussion and tips for power savings, 
visit http://www.lesswatts.org

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Revert "gro: Fix legacy path napi_complete crash",
  2009-03-24 21:36         ` David Miller
  2009-03-24 22:47           ` David Miller
@ 2009-03-25  0:23           ` Herbert Xu
  2009-03-25  2:11             ` David Miller
  2009-03-25  9:34             ` Ingo Molnar
  1 sibling, 2 replies; 664+ messages in thread
From: Herbert Xu @ 2009-03-25  0:23 UTC (permalink / raw)
  To: David Miller
  Cc: mingo, r.schwebel, torvalds, blaschka, tglx, a.p.zijlstra,
	linux-kernel, kernel

On Tue, Mar 24, 2009 at 02:36:22PM -0700, David Miller wrote:
>
> I think the problem is that we need to do the GRO flush before the
> list delete and clearing the NAPI_STATE_SCHED bit.

Well first of all GRO shouldn't even be on in Ingo's case, unless
he enabled it by hand with ethtool.  Secondly the only thing that
touches the GRO state for the legacy path is process_backlog, and
since this is per-cpu, I can't see how another instance can run
while the first is still going.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Revert "gro: Fix legacy path napi_complete crash",
  2009-03-24 22:47           ` David Miller
@ 2009-03-25  0:24             ` Herbert Xu
  0 siblings, 0 replies; 664+ messages in thread
From: Herbert Xu @ 2009-03-25  0:24 UTC (permalink / raw)
  To: David Miller
  Cc: mingo, r.schwebel, torvalds, blaschka, tglx, a.p.zijlstra,
	linux-kernel, kernel

On Tue, Mar 24, 2009 at 03:47:43PM -0700, David Miller wrote:
>
> > I think the problem is that we need to do the GRO flush before the
> > list delete and clearing the NAPI_STATE_SCHED bit.
> 
> Ok Herbert, I'm even more sure of this because in your original commit
> log message you mention:
> 
> 	This simply doesn't work since we need to flush the held
> 	GRO packets first.

That's only because I was calling __napi_complete, which is used
by drivers in general so I added the check to ensure that GRO
packets have been flushed.  Now that we're open-coding it this is
no longer a requirement.

But what's more GRO should be off on Ingo's test machines because
we haven't added anything to turn it on by default for non-NAPI
drivers.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Revert "gro: Fix legacy path napi_complete crash", (was: Re: Linux 2.6.29)
  2009-03-24 16:02               ` Ingo Molnar
  2009-03-24 19:19                 ` Ingo Molnar
@ 2009-03-25  0:32                 ` Herbert Xu
  2009-03-25  2:09                   ` Revert "gro: Fix legacy path napi_complete crash", David Miller
  1 sibling, 1 reply; 664+ messages in thread
From: Herbert Xu @ 2009-03-25  0:32 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Robert Schwebel, Linus Torvalds, Frank Blaschka, David S. Miller,
	Thomas Gleixner, Peter Zijlstra, Linux Kernel Mailing List,
	kernel

On Tue, Mar 24, 2009 at 05:02:41PM +0100, Ingo Molnar wrote:
> 
> * Herbert Xu <herbert@gondor.apana.org.au> wrote:
>
> > What's the NIC and config on this one? If it's still using the 
> > legacy/netif_rx path, where GRO is off by default, this patch 
> > should make it exactly the same as with my original patch 
> > reverted.
> 
> Same forcedeth box i reported before. Config below. (note: if you 
> want to use it you need to run it through 'make oldconfig', with all 
> defaults accepted)
>
> CONFIG_FORCEDETH=y
> CONFIG_FORCEDETH_NAPI=y

This means that we shouldn't even invoke netif_rx/process_backlog,
so something else is going on.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Revert "gro: Fix legacy path napi_complete crash", (was: Re: Linux 2.6.29)
  2009-03-24 19:19                 ` Ingo Molnar
  2009-03-24 20:54                   ` Ingo Molnar
@ 2009-03-25  0:33                   ` Herbert Xu
  1 sibling, 0 replies; 664+ messages in thread
From: Herbert Xu @ 2009-03-25  0:33 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Robert Schwebel, Linus Torvalds, Frank Blaschka, David S. Miller,
	Thomas Gleixner, Peter Zijlstra, Linux Kernel Mailing List,
	kernel

On Tue, Mar 24, 2009 at 08:19:00PM +0100, Ingo Molnar wrote:
>
> Hm, i justhad a test failure (hung interface) with this too.

Was this with NAPI on or off?

Thanks,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Revert "gro: Fix legacy path napi_complete crash",
  2009-03-25  0:32                 ` Herbert Xu
@ 2009-03-25  2:09                   ` David Miller
  0 siblings, 0 replies; 664+ messages in thread
From: David Miller @ 2009-03-25  2:09 UTC (permalink / raw)
  To: herbert
  Cc: mingo, r.schwebel, torvalds, blaschka, tglx, a.p.zijlstra,
	linux-kernel, kernel

From: Herbert Xu <herbert@gondor.apana.org.au>
Date: Wed, 25 Mar 2009 08:32:35 +0800

> On Tue, Mar 24, 2009 at 05:02:41PM +0100, Ingo Molnar wrote:
> > 
> > * Herbert Xu <herbert@gondor.apana.org.au> wrote:
> >
> > > What's the NIC and config on this one? If it's still using the 
> > > legacy/netif_rx path, where GRO is off by default, this patch 
> > > should make it exactly the same as with my original patch 
> > > reverted.
> > 
> > Same forcedeth box i reported before. Config below. (note: if you 
> > want to use it you need to run it through 'make oldconfig', with all 
> > defaults accepted)
> >
> > CONFIG_FORCEDETH=y
> > CONFIG_FORCEDETH_NAPI=y
> 
> This means that we shouldn't even invoke netif_rx/process_backlog,
> so something else is going on.

There is always loopback which does netif_rx().

Combine that with the straight NAPI receive that forcedeth
is doing here and I'm sure there are all kinds of race
scenerios possible :-)

You're right about GRO not being relevant here.  To be honest
I wouldn't be disappointed if GRO was simply on by default
even for the legacy paths.

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-24 23:03               ` Jesse Barnes
  2009-03-25  0:05                 ` Arjan van de Ven
@ 2009-03-25  2:09                 ` Theodore Tso
  2009-03-25  3:57                   ` Jesse Barnes
  2009-03-27 11:27                 ` Martin Steigerwald
  2 siblings, 1 reply; 664+ messages in thread
From: Theodore Tso @ 2009-03-25  2:09 UTC (permalink / raw)
  To: Jesse Barnes
  Cc: Ingo Molnar, Alan Cox, Arjan van de Ven, Andrew Morton,
	Peter Zijlstra, Nick Piggin, Jens Axboe, David Rees,
	Jesper Krogh, Linus Torvalds, Linux Kernel Mailing List

On Tue, Mar 24, 2009 at 04:03:53PM -0700, Jesse Barnes wrote:
> 
> You make it sound like this is hard to do...  I was running into this
> problem *every day* until I moved to XFS recently.  I'm running a
> fairly beefy desktop (VMware running a crappy Windows install w/AV junk
> on it, builds, icecream and large mailboxes) and have a lot of RAM, but
> it became unusable for minutes at a time, which was just totally
> unacceptable, thus the switch.  Things have been better since, but are
> still a little choppy.
> 

I have 4 gigs of memory on my laptop, and I've never seen it these
sorts of issues.  So maybe filesystem hackers don't have enough
memory; or we don't use the right workloads?  It would help if I
understood how to trigger these disaster cases.  I've had to work
*really* hard (as in dd if=/dev/zero of=/mnt/dirty-me-harder) in order
to get even a 30 second fsync() delay.  So understanding what sort of
things you do that cause that many files data blocks to be dirtied,
and/or what is causing a major read workload, would be useful.

It may be that we just need to tune the VM to be much more aggressive
about pushing dirty pages to the disk sooner.  Understanding how the
dynamics are working would be the first step.

> I remember early in the 2.6.x days there was a lot of focus on making
> interactive performance good, and for a long time it was.  But this I/O
> problem has been around for a *long* time now... What happened?  Do not
> many people run into this daily?  Do all the filesystem hackers run
> with special mount options to mitigate the problem?

All I can tell you is that *I* don't run into them, even when I was
using ext3 and before I got an SSD in my laptop.  I don't understand
why; maybe because I don't get really nice toys like systems with
32G's of memory.  Or maybe it's because I don't use icecream (whatever
that is).  What ever it is, it would be useful to get some solid
reproduction information, with details about hardware configuration,
and information collecting using sar and scripts that gather
/proc/meminfo every 5 seconds, and what the applications were doing at
the time.

It might also be useful for someone to try reducing the amount of
memory the system is using by using mem= on the boot line, and see if
that changes things, and to try simplifying the application workload,
and/or using iotop to determine what is most contributing to the
problem.  (And of course, this needs to be done with someone using
ext3, since both ext4 and XFS use delayed allocation, which will
largely make this problem go away.)

					- Ted

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Revert "gro: Fix legacy path napi_complete crash",
  2009-03-25  0:23           ` Herbert Xu
@ 2009-03-25  2:11             ` David Miller
  2009-03-25  7:33               ` Ingo Molnar
  2009-03-25  9:34             ` Ingo Molnar
  1 sibling, 1 reply; 664+ messages in thread
From: David Miller @ 2009-03-25  2:11 UTC (permalink / raw)
  To: herbert
  Cc: mingo, r.schwebel, torvalds, blaschka, tglx, a.p.zijlstra,
	linux-kernel, kernel

From: Herbert Xu <herbert@gondor.apana.org.au>
Date: Wed, 25 Mar 2009 08:23:03 +0800

> On Tue, Mar 24, 2009 at 02:36:22PM -0700, David Miller wrote:
> >
> > I think the problem is that we need to do the GRO flush before the
> > list delete and clearing the NAPI_STATE_SCHED bit.
> 
> Well first of all GRO shouldn't even be on in Ingo's case, unless
> he enabled it by hand with ethtool.  Secondly the only thing that
> touches the GRO state for the legacy path is process_backlog, and
> since this is per-cpu, I can't see how another instance can run
> while the first is still going.

Right.

I think the conditions Ingo is running under is that both
loopback (using legacy paths) and his NAPI based device
(forcedeth) are processing a lot of packets at the same
time.

Another thing that seems to be critical is he can only trigger this on
UP, which means that we don't have the damn APIC potentially moving
the cpu target of the forcedeth interrupts around.  And this means
also that all the processing will be on one cpu's backlog queue only.

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-25  2:09                 ` Theodore Tso
@ 2009-03-25  3:57                   ` Jesse Barnes
  0 siblings, 0 replies; 664+ messages in thread
From: Jesse Barnes @ 2009-03-25  3:57 UTC (permalink / raw)
  To: Theodore Tso
  Cc: Ingo Molnar, Alan Cox, Arjan van de Ven, Andrew Morton,
	Peter Zijlstra, Nick Piggin, Jens Axboe, David Rees,
	Jesper Krogh, Linus Torvalds, Linux Kernel Mailing List

On Tue, 24 Mar 2009 22:09:15 -0400
Theodore Tso <tytso@mit.edu> wrote:

> On Tue, Mar 24, 2009 at 04:03:53PM -0700, Jesse Barnes wrote:
> > 
> > You make it sound like this is hard to do...  I was running into
> > this problem *every day* until I moved to XFS recently.  I'm
> > running a fairly beefy desktop (VMware running a crappy Windows
> > install w/AV junk on it, builds, icecream and large mailboxes) and
> > have a lot of RAM, but it became unusable for minutes at a time,
> > which was just totally unacceptable, thus the switch.  Things have
> > been better since, but are still a little choppy.
> > 
> 
> I have 4 gigs of memory on my laptop, and I've never seen it these
> sorts of issues.  So maybe filesystem hackers don't have enough
> memory; or we don't use the right workloads?  It would help if I
> understood how to trigger these disaster cases.  I've had to work
> *really* hard (as in dd if=/dev/zero of=/mnt/dirty-me-harder) in order
> to get even a 30 second fsync() delay.  So understanding what sort of
> things you do that cause that many files data blocks to be dirtied,
> and/or what is causing a major read workload, would be useful.
> 
> It may be that we just need to tune the VM to be much more aggressive
> about pushing dirty pages to the disk sooner.  Understanding how the
> dynamics are working would be the first step.

Well I think that's part of the problem; this is bigger than just
filesystems; I've been using ext3 since before I started seeing this,
so it seems like a bad VM/fs interaction may be to blame.

> > I remember early in the 2.6.x days there was a lot of focus on
> > making interactive performance good, and for a long time it was.
> > But this I/O problem has been around for a *long* time now... What
> > happened?  Do not many people run into this daily?  Do all the
> > filesystem hackers run with special mount options to mitigate the
> > problem?
> 
> All I can tell you is that *I* don't run into them, even when I was
> using ext3 and before I got an SSD in my laptop.  I don't understand
> why; maybe because I don't get really nice toys like systems with
> 32G's of memory.  Or maybe it's because I don't use icecream (whatever
> that is).  What ever it is, it would be useful to get some solid
> reproduction information, with details about hardware configuration,
> and information collecting using sar and scripts that gather
> /proc/meminfo every 5 seconds, and what the applications were doing at
> the time.

icecream is a distributed compiler system.  Like distcc but a bit more
cross-compile & heterogeneous compiler friendly.

> It might also be useful for someone to try reducing the amount of
> memory the system is using by using mem= on the boot line, and see if
> that changes things, and to try simplifying the application workload,
> and/or using iotop to determine what is most contributing to the
> problem.  (And of course, this needs to be done with someone using
> ext3, since both ext4 and XFS use delayed allocation, which will
> largely make this problem go away.)

Yep, and that's where my blame comes in.  I whined about this to a few
people, like Arjan, who provided workarounds, but never got beyond
that.  Some real debugging would be needed to find & fix the root
cause(s).

-- 
Jesse Barnes, Intel Open Source Technology Center

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-24 20:24               ` David Rees
@ 2009-03-25  7:30                 ` David Rees
  0 siblings, 0 replies; 664+ messages in thread
From: David Rees @ 2009-03-25  7:30 UTC (permalink / raw)
  To: Theodore Tso, Ingo Molnar, Alan Cox, Arjan van de Ven,
	Andrew Morton, Peter Zijlstra, Nick Piggin, Jens Axboe,
	David Rees, Jesper Krogh, Linus Torvalds,
	Linux Kernel Mailing List

On Tue, Mar 24, 2009 at 1:24 PM, David Rees <drees76@gmail.com> wrote:
> On Tue, Mar 24, 2009 at 6:20 AM, Theodore Tso <tytso@mit.edu> wrote:
>> The only realistic workload
>> I've found that triggers this requires a fast network dumping data to
>> a local filesystem.
>
> It's pretty easy to reproduce it these days.  Here's my setup, and
> it's not even that fancy:  Dual core Xeon, 8GB RAM, SATA RAID1 array,
> GigE network.  All it takes is a single client writing a large file
> using Samba or NFS to introduce huge latencies.
>
> Looking at the raw throughput, the server's disks can sustain
> 30-60MB/s writes (older disks), but the network can handle up to
> ~100MB/s.  Throw in some other random seeky IO on the server, a bunch
> of fragmentation and it's sustained write throughput in reality for
> these writes is more like 10-25MB/s, far slower than the rate at which
> a client can throw data at it.
>
>> (I'm sure someone will be ingeniuous enough to find something else
>> though, and if they're interested, I've attached an fsync latency
>> tester to this note.  If you find something; let me know, I'd be
>> interested.)

OK, two simple tests on this system produce latencies well over 1-2s
using your fsync-tester.

The network client writing to disk scenario (~1GB file) resulted in this:
fsync time: 6.5272
fsync time: 35.6803
fsync time: 15.6488
fsync time: 0.3570

One thing to note - writing to this particular array seems to have
higher than expected latency without the big write, on the order of
0.2 seconds or so.  I think this is because the system is not idle and
has a good number of programs on it doing logging and other small bits
of IO. vmstat 5 shows the system writing out about 300-1000 under the
bo column.

Copying that file to a separate disk was not as bad, but there were
still some big spikes:

fsync time: 6.8808
fsync time: 18.4634
fsync time: 9.6852
fsync time: 10.6146
fsync time: 8.5015
fsync time: 5.2160

The destination disk did not have any significant IO on it at the time.

The system is running Fedora 10 2.6.27.19-78.2.30.fc9.x86_64 and has
two RAID1 arrays attached to an aacraid controller. ext3 filesystems
mounted with noatime.

-Dave

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Revert "gro: Fix legacy path napi_complete crash",
  2009-03-25  2:11             ` David Miller
@ 2009-03-25  7:33               ` Ingo Molnar
  2009-03-25  8:04                 ` David Miller
  2009-03-25 12:08                 ` Herbert Xu
  0 siblings, 2 replies; 664+ messages in thread
From: Ingo Molnar @ 2009-03-25  7:33 UTC (permalink / raw)
  To: David Miller
  Cc: herbert, r.schwebel, torvalds, blaschka, tglx, a.p.zijlstra,
	linux-kernel, kernel


* David Miller <davem@davemloft.net> wrote:

> From: Herbert Xu <herbert@gondor.apana.org.au>
> Date: Wed, 25 Mar 2009 08:23:03 +0800
> 
> > On Tue, Mar 24, 2009 at 02:36:22PM -0700, David Miller wrote:
> > >
> > > I think the problem is that we need to do the GRO flush before the
> > > list delete and clearing the NAPI_STATE_SCHED bit.
> > 
> > Well first of all GRO shouldn't even be on in Ingo's case, unless
> > he enabled it by hand with ethtool.  Secondly the only thing that
> > touches the GRO state for the legacy path is process_backlog, and
> > since this is per-cpu, I can't see how another instance can run
> > while the first is still going.
> 
> Right.
> 
> I think the conditions Ingo is running under is that both loopback 
> (using legacy paths) and his NAPI based device (forcedeth) are 
> processing a lot of packets at the same time.
> 
> Another thing that seems to be critical is he can only trigger 
> this on UP, which means that we don't have the damn APIC 
> potentially moving the cpu target of the forcedeth interrupts 
> around.  And this means also that all the processing will be on 
> one cpu's backlog queue only.

I tested the plain revert i sent in the original report overnight 
(with about 12 hours of combined testing time), and all systems held 
up fine. The system that would reproduce the bug within 10-20 
iterations did 210 successful iterations. Other systems held up fine 
too.

So if there's no definitive resolution for the real cause of the 
bug, the plain revert looks like an acceptable interim choice for 
.29.1 - at least as far as my systems go.

	Ingo

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Revert "gro: Fix legacy path napi_complete crash",
  2009-03-25  7:33               ` Ingo Molnar
@ 2009-03-25  8:04                 ` David Miller
  2009-03-25 12:08                 ` Herbert Xu
  1 sibling, 0 replies; 664+ messages in thread
From: David Miller @ 2009-03-25  8:04 UTC (permalink / raw)
  To: mingo
  Cc: herbert, r.schwebel, torvalds, blaschka, tglx, a.p.zijlstra,
	linux-kernel, kernel

From: Ingo Molnar <mingo@elte.hu>
Date: Wed, 25 Mar 2009 08:33:49 +0100

> So if there's no definitive resolution for the real cause of the 
> bug, the plain revert looks like an acceptable interim choice for 
> .29.1 - at least as far as my systems go.

Then we get back the instant OOPS that patch fixes :-)

I'm sure Herbert will look into fixing this properly.

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-24 19:55                     ` Jeff Garzik
@ 2009-03-25  9:34                       ` Benny Halevy
  2009-03-25  9:39                       ` Jens Axboe
  1 sibling, 0 replies; 664+ messages in thread
From: Benny Halevy @ 2009-03-25  9:34 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Linus Torvalds, Theodore Tso, Ingo Molnar, Alan Cox,
	Arjan van de Ven, Andrew Morton, Peter Zijlstra, Nick Piggin,
	Jens Axboe, David Rees, Jesper Krogh, Linux Kernel Mailing List

On Mar. 24, 2009, 21:55 +0200, Jeff Garzik <jeff@garzik.org> wrote:
> Linus Torvalds wrote:
>> But I really don't understand filesystem people who think that "fsck" is 
>> the important part, regardless of whether the data is valid or not. That's 
>> just stupid and _obviously_ bogus.
> 
> I think I can understand that point of view, at least:
> 
> More customers complain about hours-long fsck times than they do about 
> silent data corruption of non-fsync'd files.
> 
> 
>> The point is, if you write your metadata earlier (say, every 5 sec) and 
>> the real data later (say, every 30 sec), you're actually MORE LIKELY to 
>> see corrupt files than if you try to write them together.
>>
>> And if you write your data _first_, you're never going to see corruption 
>> at all.
> 
> Amen.
> 
> And, personal filesystem pet peeve:  please encourage proper FLUSH CACHE 
> use to give users the data guarantees they deserve.  Linux's sync(2) and 
> fsync(2) (and fdatasync, etc.) should poke the block layer to guarantee 
> a media write.

I completely agree.  This also applies to nfsd_sync, by the way.
What's the right place to implement that?
How about sync_blockdev?

Benny

> 
> 	Jeff
> 
> 
> P.S.  Overall, I am thrilled that this ext3/ext4 transition and 
> associated slashdotting has spurred debate over filesystem data 
> guarantees.  This is the kind of discussion that has needed to happen 
> for years, IMO.
> 
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/


-- 
Benny Halevy
Software Architect
Panasas, Inc.
bhalevy@panasas.com
Tel/Fax: +972-3-647-8340
Mobile: +972-54-802-8340

Panasas: The Leader in Parallel Storage
www.panasas.com

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Revert "gro: Fix legacy path napi_complete crash",
  2009-03-25  0:23           ` Herbert Xu
  2009-03-25  2:11             ` David Miller
@ 2009-03-25  9:34             ` Ingo Molnar
  1 sibling, 0 replies; 664+ messages in thread
From: Ingo Molnar @ 2009-03-25  9:34 UTC (permalink / raw)
  To: Herbert Xu
  Cc: David Miller, r.schwebel, torvalds, blaschka, tglx, a.p.zijlstra,
	linux-kernel, kernel


* Herbert Xu <herbert@gondor.apana.org.au> wrote:

> On Tue, Mar 24, 2009 at 02:36:22PM -0700, David Miller wrote:
> >
> > I think the problem is that we need to do the GRO flush before the
> > list delete and clearing the NAPI_STATE_SCHED bit.
> 
> Well first of all GRO shouldn't even be on in Ingo's case, unless
> he enabled it by hand with ethtool. [...]

i didnt. (But it's randconfig - so please have a good look at all 
.config details - maybe something has an unexpected side-effect?)

	Ingo

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-24 19:55                     ` Jeff Garzik
  2009-03-25  9:34                       ` Benny Halevy
@ 2009-03-25  9:39                       ` Jens Axboe
  2009-03-25 19:32                         ` Jeff Garzik
  1 sibling, 1 reply; 664+ messages in thread
From: Jens Axboe @ 2009-03-25  9:39 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Linus Torvalds, Theodore Tso, Ingo Molnar, Alan Cox,
	Arjan van de Ven, Andrew Morton, Peter Zijlstra, Nick Piggin,
	David Rees, Jesper Krogh, Linux Kernel Mailing List

On Tue, Mar 24 2009, Jeff Garzik wrote:
> Linus Torvalds wrote:
>> But I really don't understand filesystem people who think that "fsck" 
>> is the important part, regardless of whether the data is valid or not. 
>> That's just stupid and _obviously_ bogus.
>
> I think I can understand that point of view, at least:
>
> More customers complain about hours-long fsck times than they do about  
> silent data corruption of non-fsync'd files.
>
>
>> The point is, if you write your metadata earlier (say, every 5 sec) and 
>> the real data later (say, every 30 sec), you're actually MORE LIKELY to 
>> see corrupt files than if you try to write them together.
>>
>> And if you write your data _first_, you're never going to see 
>> corruption at all.
>
> Amen.
>
> And, personal filesystem pet peeve:  please encourage proper FLUSH CACHE  
> use to give users the data guarantees they deserve.  Linux's sync(2) and  
> fsync(2) (and fdatasync, etc.) should poke the block layer to guarantee  
> a media write.

fsync already does that, at least if you have barriers enabled on your
drive.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Revert "gro: Fix legacy path napi_complete crash",
  2009-03-25  7:33               ` Ingo Molnar
  2009-03-25  8:04                 ` David Miller
@ 2009-03-25 12:08                 ` Herbert Xu
  2009-03-25 12:20                   ` Ingo Molnar
  2009-03-26  7:59                   ` David Miller
  1 sibling, 2 replies; 664+ messages in thread
From: Herbert Xu @ 2009-03-25 12:08 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: David Miller, r.schwebel, torvalds, blaschka, tglx, a.p.zijlstra,
	linux-kernel, kernel

On Wed, Mar 25, 2009 at 08:33:49AM +0100, Ingo Molnar wrote:
> 
> So if there's no definitive resolution for the real cause of the 
> bug, the plain revert looks like an acceptable interim choice for 
> .29.1 - at least as far as my systems go.

OK, let's just do the revert and disable GRO for the legacy path.
This should be the safest option for 2.6.29.

GRO: Disable GRO on legacy netif_rx path

When I fixed the GRO crash in the legacy receive path I used
napi_complete to replace __napi_complete.  Unfortunately they're
not the same when NETPOLL is enabled, which may result in us
not calling __napi_complete at all.

What's more, we really do need to keep the __napi_complete call
within the IRQ-off section since in theory an IRQ can occur in
between and fill up the backlog to the maximum, causing us to
lock up.

Since we can't seem to find a fix that works properly right now,
this patch reverts all the GRO support from the netif_rx path.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

diff --git a/net/core/dev.c b/net/core/dev.c
index e3fe5c7..e438f54 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2588,18 +2588,15 @@ static int process_backlog(struct napi_struct *napi, int quota)
 		local_irq_disable();
 		skb = __skb_dequeue(&queue->input_pkt_queue);
 		if (!skb) {
+			__napi_complete(napi);
 			local_irq_enable();
-			napi_complete(napi);
-			goto out;
+			break;
 		}
 		local_irq_enable();
 
-		napi_gro_receive(napi, skb);
+		netif_receive_skb(skb);
 	} while (++work < quota && jiffies == start_time);
 
-	napi_gro_flush(napi);
-
-out:
 	return work;
 }
 
Thanks,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply related	[flat|nested] 664+ messages in thread

* Re: Revert "gro: Fix legacy path napi_complete crash",
  2009-03-25 12:08                 ` Herbert Xu
@ 2009-03-25 12:20                   ` Ingo Molnar
  2009-03-25 12:26                     ` Herbert Xu
  2009-03-26  7:59                   ` David Miller
  1 sibling, 1 reply; 664+ messages in thread
From: Ingo Molnar @ 2009-03-25 12:20 UTC (permalink / raw)
  To: Herbert Xu
  Cc: David Miller, r.schwebel, torvalds, blaschka, tglx, a.p.zijlstra,
	linux-kernel, kernel


* Herbert Xu <herbert@gondor.apana.org.au> wrote:

> On Wed, Mar 25, 2009 at 08:33:49AM +0100, Ingo Molnar wrote:
> > 
> > So if there's no definitive resolution for the real cause of the 
> > bug, the plain revert looks like an acceptable interim choice for 
> > .29.1 - at least as far as my systems go.
> 
> OK, let's just do the revert and disable GRO for the legacy path.
> This should be the safest option for 2.6.29.

ok - i have started testing the delta below, on top of the plain 
revert.

	Ingo

diff --git a/net/core/dev.c b/net/core/dev.c
index c1e9dc0..e438f54 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2594,11 +2594,9 @@ static int process_backlog(struct napi_struct *napi, int quota)
 		}
 		local_irq_enable();
 
-		napi_gro_receive(napi, skb);
+		netif_receive_skb(skb);
 	} while (++work < quota && jiffies == start_time);
 
-	napi_gro_flush(napi);
-
 	return work;
 }
 

^ permalink raw reply related	[flat|nested] 664+ messages in thread

* Re: Revert "gro: Fix legacy path napi_complete crash",
  2009-03-25 12:20                   ` Ingo Molnar
@ 2009-03-25 12:26                     ` Herbert Xu
  2009-03-25 22:01                       ` Ingo Molnar
  2009-03-25 22:54                       ` Jarek Poplawski
  0 siblings, 2 replies; 664+ messages in thread
From: Herbert Xu @ 2009-03-25 12:26 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: David Miller, r.schwebel, torvalds, blaschka, tglx, a.p.zijlstra,
	linux-kernel, kernel

On Wed, Mar 25, 2009 at 01:20:46PM +0100, Ingo Molnar wrote:
> 
> ok - i have started testing the delta below, on top of the plain 
> revert.

Thanks! BTW Ingo, any chance you could help us identify the problem
with the previous patch? I don't have a forcedeth machine here
and the hang you had with my patch that open-coded __napi_complete
appears intimately connected to forcedeth (with NAPI enabled).

The simplest thing to try would be to build forcedeth.c with DEBUG
and see what it prints out after it locks up.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-24 11:12             ` Andrew Morton
  2009-03-24 12:23               ` Alan Cox
  2009-03-24 13:37               ` Theodore Tso
@ 2009-03-25 12:37               ` Jan Kara
  2009-03-25 15:00                 ` Theodore Tso
  2 siblings, 1 reply; 664+ messages in thread
From: Jan Kara @ 2009-03-25 12:37 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Ingo Molnar, Alan Cox, Arjan van de Ven, Peter Zijlstra,
	Nick Piggin, Theodore Tso, Jens Axboe, David Rees, Jesper Krogh,
	Linus Torvalds, Linux Kernel Mailing List

On Tue 24-03-09 04:12:49, Andrew Morton wrote:
> On Tue, 24 Mar 2009 11:31:11 +0100 Ingo Molnar <mingo@elte.hu> wrote:
> > The thing is ... this is a _bad_ ext3 design bug affecting ext3 
> > users in the last decade or so of ext3 existence. Why is this issue 
> > not handled with the utmost high priority and why wasnt it fixed 5 
> > years ago already? :-)
> > 
> > It does not matter whether we have extents or htrees when there are 
> > _trivially reproducible_ basic usability problems with ext3.
> > 
> 
> It's all there in that Oct 2008 thread.
> 
> The proposed tweak to kjournald is a bad fix - partly because it will
> elevate the priority of vast amounts of IO whose priority we don't _want_
> elevated.
> 
> But mainly because the problem lies elsewhere - in an area of contention
> between the committing and running transactions which we knowingly and
> reluctantly added to fix a bug in 
> 
> commit 773fc4c63442fbd8237b4805627f6906143204a8
> Author:     akpm <akpm>
> AuthorDate: Sun May 19 23:23:01 2002 +0000
> Commit:     akpm <akpm>
> CommitDate: Sun May 19 23:23:01 2002 +0000
> 
>     [PATCH] fix ext3 buffer-stealing
>     
>     Patch from sct fixes a long-standing (I did it!) and rather complex
>     problem with ext3.
>     
>     The problem is to do with buffers which are continually being dirtied
>     by an external agent.  I had code in there (for easily-triggerable
>     livelock avoidance) which steals the buffer from checkpoint mode and
>     reattaches it to the running transaction.  This violates ext3 ordering
>     requirements - it can permit journal space to be reclaimed before the
>     relevant data has really been written out.
>     
>     Also, we do have to reliably get a lock on the buffer when moving it
>     between lists and inspecting its internal state.  Otherwise a competing
>     read from the underlying block device can trigger an assertion failure,
>     and a competing write to the underlying block device can confuse ext3
>     journalling state completely.
  I've looked at this a bit. I suppose you mean the contention arising from
us taking the buffer lock in do_get_write_access()? But it's not obvious
to me why we'd be contending there... We call this function only for
metadata buffers (unless in data=journal mode) so there isn't huge amount
of these blocks. This buffer should be locked for a longer time only when
we do writeout for checkpoint (hmm, maybe you meant this one?). In
particular, note that we don't take the buffer lock when committing this
block to journal - we lock only the BJ_IO buffer. But in this case we wait
when the buffer is on BJ_Shadow list later so there is some contention in
this case.
  Also when I emailed with a few people about these sync problems, they
wrote that switching to data=writeback mode helps considerably so this
would indicate that handling of ordered mode data buffers is causing most
of the slowdown...

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-25 12:37               ` Jan Kara
@ 2009-03-25 15:00                 ` Theodore Tso
  2009-03-25 17:29                   ` Linus Torvalds
  0 siblings, 1 reply; 664+ messages in thread
From: Theodore Tso @ 2009-03-25 15:00 UTC (permalink / raw)
  To: Jan Kara
  Cc: Andrew Morton, Ingo Molnar, Alan Cox, Arjan van de Ven,
	Peter Zijlstra, Nick Piggin, Jens Axboe, David Rees,
	Jesper Krogh, Linus Torvalds, Linux Kernel Mailing List

On Wed, Mar 25, 2009 at 01:37:44PM +0100, Jan Kara wrote:
> >     Also, we do have to reliably get a lock on the buffer when moving it
> >     between lists and inspecting its internal state.  Otherwise a competing
> >     read from the underlying block device can trigger an assertion failure,
> >     and a competing write to the underlying block device can confuse ext3
> >     journalling state completely.
>
>   I've looked at this a bit. I suppose you mean the contention arising from
> us taking the buffer lock in do_get_write_access()? But it's not obvious
> to me why we'd be contending there... We call this function only for
> metadata buffers (unless in data=journal mode) so there isn't huge amount
> of these blocks.

There isn't a huge number of those blocks, but if inode #1220 was
modified in the previous transaction which is now being committed, and
we then need to modify and write out inode #1221 in the current
contention, and they share the same inode table block, that would
cause the contention.  That probably doesn't happen that often in a
synchronous code path, but it probably happens more often that you're
thinking.  I still think the fsync() problem is the much bigger deal,
and solving the contention problem isn't going to solve the fsync()
latency problem with ext3 data=ordered mode.

>   Also when I emailed with a few people about these sync problems, they
> wrote that switching to data=writeback mode helps considerably so this
> would indicate that handling of ordered mode data buffers is causing most
> of the slowdown...

Yes, but we need to be clear whether this was an fsync() problem or
some other random delay problem.  If it's the fsync() problem,
obviously data=writeback will solve the fsync() latency delay problem.
(As will using delayed allocation in ext4 or XFS.)

    	       	       		     	  - Ted

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-25 15:00                 ` Theodore Tso
@ 2009-03-25 17:29                   ` Linus Torvalds
  2009-03-25 17:57                     ` Alan Cox
                                       ` (2 more replies)
  0 siblings, 3 replies; 664+ messages in thread
From: Linus Torvalds @ 2009-03-25 17:29 UTC (permalink / raw)
  To: Theodore Tso
  Cc: Jan Kara, Andrew Morton, Ingo Molnar, Alan Cox, Arjan van de Ven,
	Peter Zijlstra, Nick Piggin, Jens Axboe, David Rees,
	Jesper Krogh, Linux Kernel Mailing List



On Wed, 25 Mar 2009, Theodore Tso wrote:
>
> I still think the fsync() problem is the much bigger deal, and solving 
> the contention problem isn't going to solve the fsync() latency problem 
> with ext3 data=ordered mode.

The fsync() problem is really annoying, but what is doubly annoying is 
that sometimes one process doing fsync() (or sync) seems to cause other 
processes to hickup too. 

Now, I personally solved that problem by moving to (good) SSD's on my 
desktop, and I think that's indeed the long-term solution. But it would be 
good to try to figure out a solution in the short term for people who 
don't have new hardware thrown at them from random companies too.

I suspect it's a combination of filesystem transaction locking, together 
with the VM wanting to write out some unrelated blocks or inodes due to 
the system just being close to the dirty limits. Which is why the 
system-wide hickups then happen especially when writing big files.

The VM _tries_ to do writes in the background, but if the writepage() path 
hits a filesystem-level blocking lock, that background write suddenly 
becomes largely synchronous.

I suspect there is also some possibility of confusion with inter-file 
(false) metadata dependencies. If a filesystem were to think that the file 
size is metadata that should be journaled (in a single journal), and the 
journaling code then decides that it needs to do those meta-data updates 
in the correct order (ie the big file write _before_ the file write that 
wants to be fsync'ed), then the fsync() will be delayed by a totally 
irrelevant large file having to have its data written out (due to 
data=ordered or whatever).

I'd like to think that no filesystem designer would ever be that silly, 
but I'm too scared to try to actually go and check. Because I could well 
imagine that somebody really thought that "size" is metadata.

			Linus

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-24 19:00       ` David Rees
@ 2009-03-25 17:42         ` Jesper Krogh
  2009-03-25 18:16           ` David Rees
  2009-03-25 18:30         ` Theodore Tso
  1 sibling, 1 reply; 664+ messages in thread
From: Jesper Krogh @ 2009-03-25 17:42 UTC (permalink / raw)
  To: David Rees; +Cc: Linus Torvalds, Linux Kernel Mailing List

David Rees wrote:
> On Tue, Mar 24, 2009 at 12:32 AM, Jesper Krogh <jesper@krogh.cc> wrote:
>> David Rees wrote:
>> The 480 secondes is not the "wait time" but the time gone before the
>> message is printed. It the kernel-default it was earlier 120 seconds but
>> thats changed by Ingo Molnar back in september. I do get a lot of less
>> noise but it really doesn't tell anything about the nature of the problem.
>>
>> The systes spec:
>> 32GB of memory. The disks are a Nexsan SataBeast with 42 SATA drives in
>> Raid10 connected using 4Gbit fibre-channel. I'll let it up to you to decide
>> if thats fast or slow?
> 
> The drives should be fast enough to saturate 4Gbit FC in streaming
> writes.  How fast is the array in practice?

Thats allways a good question.. This is by far not being the only user
of the array at the time of testing.. (there are 4 FC-channel connected 
to a switch). Creating a fresh slice.. and just dd'ing onto it from 
/dev/zero gives:
jk@hest:~$ sudo dd if=/dev/zero of=/dev/sdh bs=1M count=10000
10000+0 records in
10000+0 records out
10485760000 bytes (10 GB) copied, 78.0557 s, 134 MB/s
jk@hest:~$ sudo dd if=/dev/zero of=/dev/sdh bs=1M count=1000
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 8.11019 s, 129 MB/s

Watching using dstat while dd'ing it peaks at 220M/s

If I watch numbers on "dstat" output in production. It gets at peak
around the same(130MB/s) but average is in the 90-100 MB/s range.

It has 2GB of battery backed cache. I'm fairly sure that when it was new 
(and I only had connected one host) I could get it up at around 350MB/s.

>> The strange thing is actually that the above process (updatedb.mlocate) is
>> writing to / which is a device without any activity at all. All activity is
>> on the Fibre Channel device above, but process writing outsid that seems to
>> be effected as well.
> 
> Ah.  Sounds like your setup would benefit immensely from the per-bdi
> patches from Jens Axobe.  I'm sure he would appreciate some feedback
> from users like you on them.
> 
>>> What's your vm.dirty_background_ratio and
>>>
>>> vm.dirty_ratio set to?
>> 2.6.29-rc8 defaults:
>> jk@hest:/proc/sys/vm$ cat dirty_background_ratio
>> 5
>> jk@hest:/proc/sys/vm$ cat dirty_ratio
>> 10
> 
> On a 32GB system that's 1.6GB of dirty data, but your array should be
> able to write that out fairly quickly (in a couple seconds) as long as
> it's not too random.  If it's spread all over the disk, write
> throughput will drop significantly - how fast is data being written to
> disk when your system suffers from large write latency?

Thats another thing. I havent been debugging while hitting it (yet) but 
if I go ind and do a sync on the system manually. Then it doesn't get 
above 50MB/s in writeout (measured using dstat). But even that doesn't 
sum up to 8 minutes .. 1.6GB at 50MB/s ..=> 32 s.

-- 
Jesper


^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-25 17:29                   ` Linus Torvalds
@ 2009-03-25 17:57                     ` Alan Cox
  2009-03-25 18:09                     ` David Rees
  2009-03-25 18:58                     ` Theodore Tso
  2 siblings, 0 replies; 664+ messages in thread
From: Alan Cox @ 2009-03-25 17:57 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Theodore Tso, Jan Kara, Andrew Morton, Ingo Molnar,
	Arjan van de Ven, Peter Zijlstra, Nick Piggin, Jens Axboe,
	David Rees, Jesper Krogh, Linux Kernel Mailing List

> The fsync() problem is really annoying, but what is doubly annoying is 
> that sometimes one process doing fsync() (or sync) seems to cause other 
> processes to hickup too. 

Bug #5942 (interaction with anticipatory io scheduler)
Bug #9546 (with reproducer & logs)
Bug #9911 including a rather natty tester (albeit in java)
Bug #7372 (some info and figures on certain revs it seemed to get worse)
Bug #12309 (more info, including kjournald hack fix using ioprio)

General consensus seems to be 2.6.18 is where the manure intersected with
the air impeller

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-25  0:05                 ` Arjan van de Ven
@ 2009-03-25 17:59                   ` David Rees
  2009-03-25 18:40                   ` Stephen Clark
  1 sibling, 0 replies; 664+ messages in thread
From: David Rees @ 2009-03-25 17:59 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Jesse Barnes, Theodore Tso, Ingo Molnar, Alan Cox, Andrew Morton,
	Peter Zijlstra, Nick Piggin, Jens Axboe, Jesper Krogh,
	Linus Torvalds, Linux Kernel Mailing List

On Tue, Mar 24, 2009 at 5:05 PM, Arjan van de Ven <arjan@infradead.org> wrote:
> On Tue, 24 Mar 2009 16:03:53 -0700
> Jesse Barnes <jbarnes@virtuousgeek.org> wrote:
>> I remember early in the 2.6.x days there was a lot of focus on making
>> interactive performance good, and for a long time it was.  But this
>> I/O problem has been around for a *long* time now... What happened?
>> Do not many people run into this daily?  Do all the filesystem
>> hackers run with special mount options to mitigate the problem?
>
> the people that care use my kernel patch on ext3 ;-)
> (or the userland equivalent tweak in /etc/rc.local)

There's a couple of comments in bug 12309 [1] which confirm that
increasing the priority of kjournald reduces latency significantly
since I posted your tweak there yesterday.  I hope to do some testing
today on my systems to see if it helps on them, too.

-Dave

[1] http://bugzilla.kernel.org/show_bug.cgi?id=12309

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-25 17:29                   ` Linus Torvalds
  2009-03-25 17:57                     ` Alan Cox
@ 2009-03-25 18:09                     ` David Rees
  2009-03-25 18:21                       ` Linus Torvalds
  2009-03-25 18:58                     ` Theodore Tso
  2 siblings, 1 reply; 664+ messages in thread
From: David Rees @ 2009-03-25 18:09 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Theodore Tso, Jan Kara, Andrew Morton, Ingo Molnar, Alan Cox,
	Arjan van de Ven, Peter Zijlstra, Nick Piggin, Jens Axboe,
	Jesper Krogh, Linux Kernel Mailing List

On Wed, Mar 25, 2009 at 10:29 AM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Wed, 25 Mar 2009, Theodore Tso wrote:
>> I still think the fsync() problem is the much bigger deal, and solving
>> the contention problem isn't going to solve the fsync() latency problem
>> with ext3 data=ordered mode.
>
> The fsync() problem is really annoying, but what is doubly annoying is
> that sometimes one process doing fsync() (or sync) seems to cause other
> processes to hickup too.
>
> Now, I personally solved that problem by moving to (good) SSD's on my
> desktop, and I think that's indeed the long-term solution. But it would be
> good to try to figure out a solution in the short term for people who
> don't have new hardware thrown at them from random companies too.

Throwing SSDs at it only increases the limit before which it becomes
an issue.  They hide the underlying issue and are only a workaround.
Create enough dirty data and you'll get the same latencies, it's just
that that limit is now a lot higher.  Your Intel SSD will write
streaming data 2-4 times faster than your typical disk - and can be an
order of magnitude faster when it comes to small, random writes.

> I suspect it's a combination of filesystem transaction locking, together
> with the VM wanting to write out some unrelated blocks or inodes due to
> the system just being close to the dirty limits. Which is why the
> system-wide hickups then happen especially when writing big files.
>
> The VM _tries_ to do writes in the background, but if the writepage() path
> hits a filesystem-level blocking lock, that background write suddenly
> becomes largely synchronous.
>
> I suspect there is also some possibility of confusion with inter-file
> (false) metadata dependencies. If a filesystem were to think that the file
> size is metadata that should be journaled (in a single journal), and the
> journaling code then decides that it needs to do those meta-data updates
> in the correct order (ie the big file write _before_ the file write that
> wants to be fsync'ed), then the fsync() will be delayed by a totally
> irrelevant large file having to have its data written out (due to
> data=ordered or whatever).

It certainly "feels" like that is the case from the workloads I have
that generate high latencies.

-Dave

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-25 17:42         ` Jesper Krogh
@ 2009-03-25 18:16           ` David Rees
  2009-03-25 18:46             ` Jesper Krogh
  0 siblings, 1 reply; 664+ messages in thread
From: David Rees @ 2009-03-25 18:16 UTC (permalink / raw)
  To: Jesper Krogh; +Cc: Linus Torvalds, Linux Kernel Mailing List

On Wed, Mar 25, 2009 at 10:42 AM, Jesper Krogh <jesper@krogh.cc> wrote:
> David Rees wrote:
>> On Tue, Mar 24, 2009 at 12:32 AM, Jesper Krogh <jesper@krogh.cc> wrote:
>>> David Rees wrote:
>>> The 480 secondes is not the "wait time" but the time gone before the
>>> message is printed. It the kernel-default it was earlier 120 seconds but
>>> thats changed by Ingo Molnar back in september. I do get a lot of less
>>> noise but it really doesn't tell anything about the nature of the
>>> problem.
>>>
>>> The systes spec:
>>> 32GB of memory. The disks are a Nexsan SataBeast with 42 SATA drives in
>>> Raid10 connected using 4Gbit fibre-channel. I'll let it up to you to
>>> decide
>>> if thats fast or slow?
>>
>> The drives should be fast enough to saturate 4Gbit FC in streaming
>> writes.  How fast is the array in practice?
>
> Thats allways a good question.. This is by far not being the only user
> of the array at the time of testing.. (there are 4 FC-channel connected to a
> switch). Creating a fresh slice.. and just dd'ing onto it from /dev/zero
> gives:
> jk@hest:~$ sudo dd if=/dev/zero of=/dev/sdh bs=1M count=10000
> 10000+0 records in
> 10000+0 records out
> 10485760000 bytes (10 GB) copied, 78.0557 s, 134 MB/s
> jk@hest:~$ sudo dd if=/dev/zero of=/dev/sdh bs=1M count=1000
> 1000+0 records in
> 1000+0 records out
> 1048576000 bytes (1.0 GB) copied, 8.11019 s, 129 MB/s
>
> Watching using dstat while dd'ing it peaks at 220M/s

Hmm, not as fast as I expected.

> It has 2GB of battery backed cache. I'm fairly sure that when it was new
> (and I only had connected one host) I could get it up at around 350MB/s.

With 2GB of BBC, I'm surprised you are seeing as much latency as you
are.  It should be able to suck down writes as fast as you can throw
at it.  Is the array configured in writeback mode?

>> On a 32GB system that's 1.6GB of dirty data, but your array should be
>> able to write that out fairly quickly (in a couple seconds) as long as
>> it's not too random.  If it's spread all over the disk, write
>> throughput will drop significantly - how fast is data being written to
>> disk when your system suffers from large write latency?
>
> Thats another thing. I havent been debugging while hitting it (yet) but if I
> go ind and do a sync on the system manually. Then it doesn't get above
> 50MB/s in writeout (measured using dstat). But even that doesn't sum up to 8
> minutes .. 1.6GB at 50MB/s ..=> 32 s.

Have you also tried increasing the IO priority of the kjournald
processes as a workaround as Arjan van de Ven suggests?

You must have a significant amount of activity going to that FC array
from other clients - it certainly doesn't seem to be performing as
well as it could/should be.

-Dave

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-25 18:09                     ` David Rees
@ 2009-03-25 18:21                       ` Linus Torvalds
  2009-03-25 18:26                         ` Linus Torvalds
  0 siblings, 1 reply; 664+ messages in thread
From: Linus Torvalds @ 2009-03-25 18:21 UTC (permalink / raw)
  To: David Rees
  Cc: Theodore Tso, Jan Kara, Andrew Morton, Ingo Molnar, Alan Cox,
	Arjan van de Ven, Peter Zijlstra, Nick Piggin, Jens Axboe,
	Jesper Krogh, Linux Kernel Mailing List



On Wed, 25 Mar 2009, David Rees wrote:
>
> Your Intel SSD will write streaming data 2-4 times faster than your 
> typical disk

Don't even bother with streaming data. The problem is _never_ streaming 
data.

Even a suck-ass laptop drive can write streaming data fast enough that 
people don't care. The problem is invariably that writes from different 
sources (much of it being metadata) interact and cause seeking.

> and can be an order of magnitude faster when it comes to small, random 
> writes.

Umm. More like two orders of magnitude or more.

Random writes on a disk (even a fast one) tends to be in the hundreds of 
kilobytes per second. Have you worked with an Intel SSD? It does tens of 
MB/s on pure random writes.

The problem really is gone with an SSD.

And please realize that the problem for me was never 30-second stalls. For 
me, a 3-second stall is unacceptable. It's just very annoying.

		Linus

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-25 18:21                       ` Linus Torvalds
@ 2009-03-25 18:26                         ` Linus Torvalds
  2009-03-25 18:48                           ` Ric Wheeler
  2009-03-25 18:49                           ` Alan Cox
  0 siblings, 2 replies; 664+ messages in thread
From: Linus Torvalds @ 2009-03-25 18:26 UTC (permalink / raw)
  To: David Rees
  Cc: Theodore Tso, Jan Kara, Andrew Morton, Ingo Molnar, Alan Cox,
	Arjan van de Ven, Peter Zijlstra, Nick Piggin, Jens Axboe,
	Jesper Krogh, Linux Kernel Mailing List



On Wed, 25 Mar 2009, Linus Torvalds wrote:
> 
> Even a suck-ass laptop drive can write streaming data fast enough that 
> people don't care. The problem is invariably that writes from different 
> sources (much of it being metadata) interact and cause seeking.

Actually, not just writes.

The IO priority thing is almost certainly that _reads_ (which get higher 
priority by default due to being synchronous) get interspersed with the 
writes, and then even if you _could_ be having streaming writes, what you 
actually end up with is lots of seeking.

Again, good SSD's don't care. Disks do. It doesn't matter if you have a FC 
disk array that can eat 300MB/s when streaming - once you start seeking, 
that 300MB/s goes down like a rock. Battery-protected write caches will 
help - but not a whole lot when streaming more data than they have RAM. 
Basic queuing theory.

			Linus

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-24 19:00       ` David Rees
  2009-03-25 17:42         ` Jesper Krogh
@ 2009-03-25 18:30         ` Theodore Tso
  2009-03-25 18:40           ` Linus Torvalds
  1 sibling, 1 reply; 664+ messages in thread
From: Theodore Tso @ 2009-03-25 18:30 UTC (permalink / raw)
  To: David Rees; +Cc: Jesper Krogh, Linus Torvalds, Linux Kernel Mailing List

On Tue, Mar 24, 2009 at 12:00:41PM -0700, David Rees wrote:
> >>> Consensus seems to be something with large memory machines, lots of dirty
> >>> pages and a long writeout time due to ext3.
> >>
> >> All filesystems seem to suffer from this issue to some degree.  I
> >> posted to the list earlier trying to see if there was anything that
> >> could be done to help my specific case.  I've got a system where if
> >> someone starts writing out a large file, it kills client NFS writes.
> >> Makes the system unusable:
> >> http://marc.info/?l=linux-kernel&m=123732127919368&w=2
> >
> > Yes, I've hit 120s+ penalties just by saving a file in vim.
> 
> Yeah, your disks aren't keeping up and/or data isn't being written out
> efficiently.

Agreed; we probably will need to get some blktrace outputs to see what
is going on.

> >> Only workaround I've found is to reduce dirty_background_ratio and
> >> dirty_ratio to tiny levels.  Or throw good SSDs and/or a fast RAID
> >> array at it so that large writes complete faster.  Have you tried the
> >> new vm_dirty_bytes in 2.6.29?
> >
> > No.. What would you suggest to be a reasonable setting for that?
> 
> Look at whatever is there by default and try cutting them in half to start.

I'm beginning to think that using a "ratio" may be the wrong way to
go.  We probably need to add an optional dirty_max_megabytes field
where we start pushing dirty blocks out when the number of dirty
blocks exceeds either the dirty_ratio or the dirty_max_megabytes,
which ever comes first.  The problem is that 5% might make sense for a
small machine with only 1G of memory, but it might not make so much
sense if you have 32G of memory.

But the other problem is whether we are issuing the writes in an
efficient way, and that means we need to see what is going on at the
blktrace level as a starting point, and maybe we'll need some
custom-designed trace outputs to see what is going on at the
inode/logical block level, not just at the physical block level.

	      	    	       	       - Ted

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-25 18:30         ` Theodore Tso
@ 2009-03-25 18:40           ` Linus Torvalds
  2009-03-25 22:05             ` Theodore Tso
  0 siblings, 1 reply; 664+ messages in thread
From: Linus Torvalds @ 2009-03-25 18:40 UTC (permalink / raw)
  To: Theodore Tso; +Cc: David Rees, Jesper Krogh, Linux Kernel Mailing List



On Wed, 25 Mar 2009, Theodore Tso wrote:
>
> I'm beginning to think that using a "ratio" may be the wrong way to
> go.  We probably need to add an optional dirty_max_megabytes field
> where we start pushing dirty blocks out when the number of dirty
> blocks exceeds either the dirty_ratio or the dirty_max_megabytes,
> which ever comes first.

We have that. Except it's called "dirty_bytes" and 
"dirty_background_bytes", and it defaults to zero (off).

The problem being that unlike the ratio, there's no sane default value 
that you can at least argue is not _entirely_ pointless.

		Linus

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-25  0:05                 ` Arjan van de Ven
  2009-03-25 17:59                   ` David Rees
@ 2009-03-25 18:40                   ` Stephen Clark
  2009-03-26 23:53                     ` Mark Lord
  1 sibling, 1 reply; 664+ messages in thread
From: Stephen Clark @ 2009-03-25 18:40 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Jesse Barnes, Theodore Tso, Ingo Molnar, Alan Cox, Andrew Morton,
	Peter Zijlstra, Nick Piggin, Jens Axboe, David Rees,
	Jesper Krogh, Linus Torvalds, Linux Kernel Mailing List

Arjan van de Ven wrote:
> On Tue, 24 Mar 2009 16:03:53 -0700
> Jesse Barnes <jbarnes@virtuousgeek.org> wrote:
> 
>> I remember early in the 2.6.x days there was a lot of focus on making
>> interactive performance good, and for a long time it was.  But this
>> I/O problem has been around for a *long* time now... What happened?
>> Do not many people run into this daily?  Do all the filesystem
>> hackers run with special mount options to mitigate the problem?
>>
> 
> the people that care use my kernel patch on ext3 ;-)
> (or the userland equivalent tweak in /etc/rc.local)
> 
> 
> 
Ok, I bite what is the userland tweak?

-- 

"They that give up essential liberty to obtain temporary safety,
deserve neither liberty nor safety."  (Ben Franklin)

"The course of history shows that as a government grows, liberty
decreases."  (Thomas Jefferson)



^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-25 18:16           ` David Rees
@ 2009-03-25 18:46             ` Jesper Krogh
  0 siblings, 0 replies; 664+ messages in thread
From: Jesper Krogh @ 2009-03-25 18:46 UTC (permalink / raw)
  To: David Rees; +Cc: Linus Torvalds, Linux Kernel Mailing List

David Rees wrote:
>>> writes.  How fast is the array in practice?
>> Thats allways a good question.. This is by far not being the only user
>> of the array at the time of testing.. (there are 4 FC-channel connected to a
>> switch). Creating a fresh slice.. and just dd'ing onto it from /dev/zero
>> gives:
>> jk@hest:~$ sudo dd if=/dev/zero of=/dev/sdh bs=1M count=10000
>> 10000+0 records in
>> 10000+0 records out
>> 10485760000 bytes (10 GB) copied, 78.0557 s, 134 MB/s
>> jk@hest:~$ sudo dd if=/dev/zero of=/dev/sdh bs=1M count=1000
>> 1000+0 records in
>> 1000+0 records out
>> 1048576000 bytes (1.0 GB) copied, 8.11019 s, 129 MB/s
>>
>> Watching using dstat while dd'ing it peaks at 220M/s
> 
> Hmm, not as fast as I expected.

Me neither, but I always get disappointed.

>> It has 2GB of battery backed cache. I'm fairly sure that when it was new
>> (and I only had connected one host) I could get it up at around 350MB/s.
> 
> With 2GB of BBC, I'm surprised you are seeing as much latency as you
> are.  It should be able to suck down writes as fast as you can throw
> at it.  Is the array configured in writeback mode?

Yes, but I triple checked.. the memory upgrade hadn't been installed, so 
its actually only 512MB.

> 
>>> On a 32GB system that's 1.6GB of dirty data, but your array should be
>>> able to write that out fairly quickly (in a couple seconds) as long as
>>> it's not too random.  If it's spread all over the disk, write
>>> throughput will drop significantly - how fast is data being written to
>>> disk when your system suffers from large write latency?
>> Thats another thing. I havent been debugging while hitting it (yet) but if I
>> go ind and do a sync on the system manually. Then it doesn't get above
>> 50MB/s in writeout (measured using dstat). But even that doesn't sum up to 8
>> minutes .. 1.6GB at 50MB/s ..=> 32 s.
> 
> Have you also tried increasing the IO priority of the kjournald
> processes as a workaround as Arjan van de Ven suggests?

No. I'll try to slip that one in.

-- 
Jesper

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-25 18:26                         ` Linus Torvalds
@ 2009-03-25 18:48                           ` Ric Wheeler
  2009-03-25 18:49                           ` Alan Cox
  1 sibling, 0 replies; 664+ messages in thread
From: Ric Wheeler @ 2009-03-25 18:48 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: David Rees, Theodore Tso, Jan Kara, Andrew Morton, Ingo Molnar,
	Alan Cox, Arjan van de Ven, Peter Zijlstra, Nick Piggin,
	Jens Axboe, Jesper Krogh, Linux Kernel Mailing List

Linus Torvalds wrote:
> On Wed, 25 Mar 2009, Linus Torvalds wrote:
>   
>> Even a suck-ass laptop drive can write streaming data fast enough that 
>> people don't care. The problem is invariably that writes from different 
>> sources (much of it being metadata) interact and cause seeking.
>>     
>
> Actually, not just writes.
>
> The IO priority thing is almost certainly that _reads_ (which get higher 
> priority by default due to being synchronous) get interspersed with the 
> writes, and then even if you _could_ be having streaming writes, what you 
> actually end up with is lots of seeking.
>
> Again, good SSD's don't care. Disks do. It doesn't matter if you have a FC 
> disk array that can eat 300MB/s when streaming - once you start seeking, 
> that 300MB/s goes down like a rock. Battery-protected write caches will 
> help - but not a whole lot when streaming more data than they have RAM. 
> Basic queuing theory.
>
> 			Linus
>   

This is actually not really true - random writes to an enterprise disk 
array will make your Intel SSD look slow. Effectively, they are 
extremely large, battery backed banks of DRAM with lots of fibre channel 
ports.  Some of the bigger ones can have several hundred GB of DRAM and 
dozens of fibre channel ports to feed them.

Of course, if your random writes exceed the cache capacity and you fall 
back to their internal disks (SSD or traditional), your random write 
speed will drop.

Ric


^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-25 18:26                         ` Linus Torvalds
  2009-03-25 18:48                           ` Ric Wheeler
@ 2009-03-25 18:49                           ` Alan Cox
  2009-03-25 18:55                             ` Ric Wheeler
  1 sibling, 1 reply; 664+ messages in thread
From: Alan Cox @ 2009-03-25 18:49 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: David Rees, Theodore Tso, Jan Kara, Andrew Morton, Ingo Molnar,
	Arjan van de Ven, Peter Zijlstra, Nick Piggin, Jens Axboe,
	Jesper Krogh, Linux Kernel Mailing List

> Again, good SSD's don't care. Disks do. It doesn't matter if you have a FC 
> disk array that can eat 300MB/s when streaming - once you start seeking, 
> that 300MB/s goes down like a rock. Battery-protected write caches will 
> help - but not a whole lot when streaming more data than they have RAM. 
> Basic queuing theory.

Subtly more complex than that. If your mashed up I/O streams fit into the
2GB or so of cache (minus one stream to disk) you win. You also win
because you take a lot of fragmented OS I/O and turn it into bigger
chunks of writing better scheduled. The latter win arguably shouldn't
happen but it does occur (I guess in part that says we suck) and it
occurs big time when you've got multiple accessors to a shared storage
system (where the host OS's can't help)

Alan

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-25 18:49                           ` Alan Cox
@ 2009-03-25 18:55                             ` Ric Wheeler
  0 siblings, 0 replies; 664+ messages in thread
From: Ric Wheeler @ 2009-03-25 18:55 UTC (permalink / raw)
  To: Alan Cox
  Cc: Linus Torvalds, David Rees, Theodore Tso, Jan Kara,
	Andrew Morton, Ingo Molnar, Arjan van de Ven, Peter Zijlstra,
	Nick Piggin, Jens Axboe, Jesper Krogh, Linux Kernel Mailing List

Alan Cox wrote:
>> Again, good SSD's don't care. Disks do. It doesn't matter if you have a FC 
>> disk array that can eat 300MB/s when streaming - once you start seeking, 
>> that 300MB/s goes down like a rock. Battery-protected write caches will 
>> help - but not a whole lot when streaming more data than they have RAM. 
>> Basic queuing theory.
>>     
>
> Subtly more complex than that. If your mashed up I/O streams fit into the
> 2GB or so of cache (minus one stream to disk) you win. You also win
> because you take a lot of fragmented OS I/O and turn it into bigger
> chunks of writing better scheduled. The latter win arguably shouldn't
> happen but it does occur (I guess in part that says we suck) and it
> occurs big time when you've got multiple accessors to a shared storage
> system (where the host OS's can't help)
>
> Alan
>   

The other thing that can impact random writes on arrays is their 
internal "track" size - if the random write is of a partial track, it 
forces a read-modify-write with a back end disk read.  Some arrays have 
large internal tracks, others have smaller ones.

Again, not unlike what you see with some SSD's and their erase block 
size - give them even multiples of that and they are quite happy.

Ric


^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-25 17:29                   ` Linus Torvalds
  2009-03-25 17:57                     ` Alan Cox
  2009-03-25 18:09                     ` David Rees
@ 2009-03-25 18:58                     ` Theodore Tso
  2009-03-25 19:48                       ` Christoph Hellwig
  2009-03-25 20:45                       ` Linus Torvalds
  2 siblings, 2 replies; 664+ messages in thread
From: Theodore Tso @ 2009-03-25 18:58 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jan Kara, Andrew Morton, Ingo Molnar, Alan Cox, Arjan van de Ven,
	Peter Zijlstra, Nick Piggin, Jens Axboe, David Rees,
	Jesper Krogh, Linux Kernel Mailing List

On Wed, Mar 25, 2009 at 10:29:48AM -0700, Linus Torvalds wrote:
> I suspect there is also some possibility of confusion with inter-file 
> (false) metadata dependencies. If a filesystem were to think that the file 
> size is metadata that should be journaled (in a single journal), and the 
> journaling code then decides that it needs to do those meta-data updates 
> in the correct order (ie the big file write _before_ the file write that 
> wants to be fsync'ed), then the fsync() will be delayed by a totally 
> irrelevant large file having to have its data written out (due to 
> data=ordered or whatever).

It's not just the file size; it's the block allocation decisions.
Ext3 doesn't have delayed allocation, so as soon as you issue the
write, we have to allocate the block, which means grabbing blocks and
making changes to the block bitmap, and then updating the inode with
those block allocation decisions.  It's a lot more than just i_size.
And the problem is that if we do this for the big file write, and the
small file write happens to also touch the same inode table block
and/or block allocation bitmap, when we fsync() the small file, when
we end up pushing out the metadata updates associated with the big
file write, and so thus we need to flush out the data blocks
associated with the big file write as well.

Now, there are three ways of solving this problem.  One is to use
delayed allocation, where we don't make the block allocation decisions
until the very last minute.  This is what ext4 and XFS does.  The
problem with this is that when we have unrelated filesystem operations
that end up causing zero length files before the file write (i.e.,
replace-via-truncate, where the application does open/truncate/write/
close) or the after the file write (i.e., replace-via-rename, where
the application does open/write/close/rename) and the application
omits the fsync().  So with ext4 we has workarounds that start pushing
out the data blocks in the for replace-via-rename and
replace-via-truncate cases, while XFS will do an implied fsync for
replace-via-truncate only, and btrfs will do an implied fsync for
replace-via-rename only.

The second solution is we could add a huge amount of machinery to try
track these logical dependencies, and then be able to "back out" the
changes to the inode table or block allocation bitmap for the big file
write when we want to fsync out the small file.  This is roughly what
the BSD Soft Updates mechanisms does, and it works, but at the cost of
a *huge* amount of complexity.  The amount of accounting data you have
to track so that you can partially back out various filesystem
operations, and then the state tables that make use of this accounting
data is not trivial.  One of the downsides of this mechanism is that
it makes it extremely difficult to add new features/functionality such
as extended attributes or ACL's, since very few people understand the
complexities needed to support it.  As a result Linux had acl and
xattr support long before Kirk McKusick got around to adding those
features in UFS2.

The third potential solution we can try doing is to make some tuning
adjustments to the VM so that we start pushing out these data blocks
much more aggressively out to the disk.  If we assume that many
applications aren't going to be using fsync, and we need to worry
about all sorts of implied dependencies where a small file gets pushed
out to disk, but a large file does not, you can have endless amounts
of fun in terms of "application level file corruption", which is
simply caused by the fact that a small file has been pushed out to
disk, and a large file hasn't been pushed out to disk yet.  If it's
going to be considered fair game that application programmers aren't
going to be required to use fsync() when they need to depend on
something being on stable storage after a crash, then we need to tune
the VM to much more aggressively clean dirty pages.  Even if we remove
the false dependencies at the filesystem level (i.e., fsck-detectable
consistency problems), there is no way for the filesystem to be able
to guess about implied dependencies between different files at the
application level.

Traditionally, the way applications told us about such dependencies
was fsync().  But if application programmers are demanding that
fsync() is no longer required for correct operation after a filesystem
crash, all we can do is push things out to disk much more
aggressively.

						- Ted

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-25  9:39                       ` Jens Axboe
@ 2009-03-25 19:32                         ` Jeff Garzik
  2009-03-25 19:43                           ` Christoph Hellwig
  2009-03-25 19:43                           ` Jens Axboe
  0 siblings, 2 replies; 664+ messages in thread
From: Jeff Garzik @ 2009-03-25 19:32 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Linus Torvalds, Theodore Tso, Ingo Molnar, Alan Cox,
	Arjan van de Ven, Andrew Morton, Peter Zijlstra, Nick Piggin,
	David Rees, Jesper Krogh, Linux Kernel Mailing List

Jens Axboe wrote:
> On Tue, Mar 24 2009, Jeff Garzik wrote:
>> Linus Torvalds wrote:
>>> But I really don't understand filesystem people who think that "fsck" 
>>> is the important part, regardless of whether the data is valid or not. 
>>> That's just stupid and _obviously_ bogus.
>> I think I can understand that point of view, at least:
>>
>> More customers complain about hours-long fsck times than they do about  
>> silent data corruption of non-fsync'd files.
>>
>>
>>> The point is, if you write your metadata earlier (say, every 5 sec) and 
>>> the real data later (say, every 30 sec), you're actually MORE LIKELY to 
>>> see corrupt files than if you try to write them together.
>>>
>>> And if you write your data _first_, you're never going to see 
>>> corruption at all.
>> Amen.
>>
>> And, personal filesystem pet peeve:  please encourage proper FLUSH CACHE  
>> use to give users the data guarantees they deserve.  Linux's sync(2) and  
>> fsync(2) (and fdatasync, etc.) should poke the block layer to guarantee  
>> a media write.
> 
> fsync already does that, at least if you have barriers enabled on your
> drive.

Erm, no, you don't enable barriers on your drive, they are not a 
hardware feature.  You enable barriers via your filesystem.

Stating "fsync already does that" borders on false, because that assumes
(a) the user has a fs that supports barriers
(b) the user is actually aware of a 'barriers' mount option and what it 
means
(c) the user has turned on an option normally defaulted to off.

Or in other words, it pretty much never happens.

Furthermore, a blatantly obvious place to flush data to media -- 
fsync(2), fdatasync(2) and sync_file_range(2) -- should cause the block 
layer to issue a FLUSH CACHE for __any__ filesystem.  But that doesn't 
happen either.

So, no, for 95% of Linux users, fsync does _not_ already do that.  If 
you are lucky enough to use XFS or ext4, you're covered.  That's it.

	Jeff




^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-25 19:32                         ` Jeff Garzik
@ 2009-03-25 19:43                           ` Christoph Hellwig
  2009-03-25 19:43                           ` Jens Axboe
  1 sibling, 0 replies; 664+ messages in thread
From: Christoph Hellwig @ 2009-03-25 19:43 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Jens Axboe, Linus Torvalds, Theodore Tso, Ingo Molnar, Alan Cox,
	Arjan van de Ven, Andrew Morton, Peter Zijlstra, Nick Piggin,
	David Rees, Jesper Krogh, Linux Kernel Mailing List

On Wed, Mar 25, 2009 at 03:32:13PM -0400, Jeff Garzik wrote:
> So, no, for 95% of Linux users, fsync does _not_ already do that.  If  
> you are lucky enough to use XFS or ext4, you're covered.  That's it.

reiserfs also does the correct thing.  As does ext3 on suse kernels.


^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-25 19:32                         ` Jeff Garzik
  2009-03-25 19:43                           ` Christoph Hellwig
@ 2009-03-25 19:43                           ` Jens Axboe
  2009-03-25 19:49                             ` Ric Wheeler
                                               ` (2 more replies)
  1 sibling, 3 replies; 664+ messages in thread
From: Jens Axboe @ 2009-03-25 19:43 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Linus Torvalds, Theodore Tso, Ingo Molnar, Alan Cox,
	Arjan van de Ven, Andrew Morton, Peter Zijlstra, Nick Piggin,
	David Rees, Jesper Krogh, Linux Kernel Mailing List

On Wed, Mar 25 2009, Jeff Garzik wrote:
> Jens Axboe wrote:
>> On Tue, Mar 24 2009, Jeff Garzik wrote:
>>> Linus Torvalds wrote:
>>>> But I really don't understand filesystem people who think that 
>>>> "fsck" is the important part, regardless of whether the data is 
>>>> valid or not. That's just stupid and _obviously_ bogus.
>>> I think I can understand that point of view, at least:
>>>
>>> More customers complain about hours-long fsck times than they do 
>>> about  silent data corruption of non-fsync'd files.
>>>
>>>
>>>> The point is, if you write your metadata earlier (say, every 5 sec) 
>>>> and the real data later (say, every 30 sec), you're actually MORE 
>>>> LIKELY to see corrupt files than if you try to write them together.
>>>>
>>>> And if you write your data _first_, you're never going to see  
>>>> corruption at all.
>>> Amen.
>>>
>>> And, personal filesystem pet peeve:  please encourage proper FLUSH 
>>> CACHE  use to give users the data guarantees they deserve.  Linux's 
>>> sync(2) and  fsync(2) (and fdatasync, etc.) should poke the block 
>>> layer to guarantee  a media write.
>>
>> fsync already does that, at least if you have barriers enabled on your
>> drive.
>
> Erm, no, you don't enable barriers on your drive, they are not a  
> hardware feature.  You enable barriers via your filesystem.

Thanks for the lesson Jeff, I'm obviously not aware how that stuff
works...

> Stating "fsync already does that" borders on false, because that assumes
> (a) the user has a fs that supports barriers
> (b) the user is actually aware of a 'barriers' mount option and what it  
> means
> (c) the user has turned on an option normally defaulted to off.
>
> Or in other words, it pretty much never happens.

That is true, except if you use xfs/ext4. And this discussion is fine,
as was the one a few months back that got ext4 to enable barriers by
default. If I had submitted patches to do that back in 2001/2 when the
barrier stuff was written, I would have been shot for introducing such a
slow down. After people found out that it just wasn't something silly,
then you have a way to enable it.

I'd still wager that most people would rather have a 'good enough
fsync' on their desktops than incur the penalty of barriers or write
through caching. I know I do.

> Furthermore, a blatantly obvious place to flush data to media --  
> fsync(2), fdatasync(2) and sync_file_range(2) -- should cause the block  
> layer to issue a FLUSH CACHE for __any__ filesystem.  But that doesn't  
> happen either.
>
> So, no, for 95% of Linux users, fsync does _not_ already do that.  If  
> you are lucky enough to use XFS or ext4, you're covered.  That's it.

The point is that you need to expose this choice somewhere, and that
'somewhere' isn't manually editing fstab and enabling barriers or
fsync-for-real. And it should be easier.

Another problem is that FLUSH_CACHE sucks. Really. And not just on
ext3/ordered, generally. Write a 50 byte file, fsync, flush cache and
wit for the world to finish. Pretty hard to teach people to use a nicer
fdatasync(), when the majority of the cost now becomes flushing the
cache of that 1TB drive you happen to have 8 partitions on. Good luck
with that.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-25 18:58                     ` Theodore Tso
@ 2009-03-25 19:48                       ` Christoph Hellwig
  2009-03-25 21:50                         ` Theodore Tso
  2009-03-25 20:45                       ` Linus Torvalds
  1 sibling, 1 reply; 664+ messages in thread
From: Christoph Hellwig @ 2009-03-25 19:48 UTC (permalink / raw)
  To: Theodore Tso, Linus Torvalds, Jan Kara, Andrew Morton,
	Ingo Molnar, Alan Cox, Arjan van de Ven, Peter Zijlstra,
	Nick Piggin, Jens Axboe, David Rees, Jesper Krogh,
	Linux Kernel Mailing List

On Wed, Mar 25, 2009 at 02:58:24PM -0400, Theodore Tso wrote:
> omits the fsync().  So with ext4 we has workarounds that start pushing
> out the data blocks in the for replace-via-rename and
> replace-via-truncate cases, while XFS will do an implied fsync for
> replace-via-truncate only, and btrfs will do an implied fsync for
> replace-via-rename only.

The XFS one and the ext4 one that I saw only start an _asynchronous_
writeout.  Which is not an implied fsync but snake oil to make the
most common complaints go away without providing hard guarantees.

IFF we want to go down this route we should better provide strong
guranteed semantics and document the propery.  And of course implement
it consistently on all native filesystems.

> Traditionally, the way applications told us about such dependencies
> was fsync().  But if application programmers are demanding that
> fsync() is no longer required for correct operation after a filesystem
> crash, all we can do is push things out to disk much more
> aggressively.

Note that the rename for atomic commits trick originated in mail severs
which always did the proper fsync.  When the word spread into the
desktop world it looks like this wisdom got lost.


^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-25 19:43                           ` Jens Axboe
@ 2009-03-25 19:49                             ` Ric Wheeler
  2009-03-25 19:57                               ` Jens Axboe
  2009-03-25 20:16                               ` Jeff Garzik
  2009-03-25 20:25                             ` Jeff Garzik
  2009-03-31 20:49                             ` Jeff Garzik
  2 siblings, 2 replies; 664+ messages in thread
From: Ric Wheeler @ 2009-03-25 19:49 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Jeff Garzik, Linus Torvalds, Theodore Tso, Ingo Molnar, Alan Cox,
	Arjan van de Ven, Andrew Morton, Peter Zijlstra, Nick Piggin,
	David Rees, Jesper Krogh, Linux Kernel Mailing List

Jens Axboe wrote:
> On Wed, Mar 25 2009, Jeff Garzik wrote:
>   
>> Jens Axboe wrote:
>>     
>>> On Tue, Mar 24 2009, Jeff Garzik wrote:
>>>       
>>>> Linus Torvalds wrote:
>>>>         
>>>>> But I really don't understand filesystem people who think that 
>>>>> "fsck" is the important part, regardless of whether the data is 
>>>>> valid or not. That's just stupid and _obviously_ bogus.
>>>>>           
>>>> I think I can understand that point of view, at least:
>>>>
>>>> More customers complain about hours-long fsck times than they do 
>>>> about  silent data corruption of non-fsync'd files.
>>>>
>>>>
>>>>         
>>>>> The point is, if you write your metadata earlier (say, every 5 sec) 
>>>>> and the real data later (say, every 30 sec), you're actually MORE 
>>>>> LIKELY to see corrupt files than if you try to write them together.
>>>>>
>>>>> And if you write your data _first_, you're never going to see  
>>>>> corruption at all.
>>>>>           
>>>> Amen.
>>>>
>>>> And, personal filesystem pet peeve:  please encourage proper FLUSH 
>>>> CACHE  use to give users the data guarantees they deserve.  Linux's 
>>>> sync(2) and  fsync(2) (and fdatasync, etc.) should poke the block 
>>>> layer to guarantee  a media write.
>>>>         
>>> fsync already does that, at least if you have barriers enabled on your
>>> drive.
>>>       
>> Erm, no, you don't enable barriers on your drive, they are not a  
>> hardware feature.  You enable barriers via your filesystem.
>>     
>
> Thanks for the lesson Jeff, I'm obviously not aware how that stuff
> works...
>
>   
>> Stating "fsync already does that" borders on false, because that assumes
>> (a) the user has a fs that supports barriers
>> (b) the user is actually aware of a 'barriers' mount option and what it  
>> means
>> (c) the user has turned on an option normally defaulted to off.
>>
>> Or in other words, it pretty much never happens.
>>     
>
> That is true, except if you use xfs/ext4. And this discussion is fine,
> as was the one a few months back that got ext4 to enable barriers by
> default. If I had submitted patches to do that back in 2001/2 when the
> barrier stuff was written, I would have been shot for introducing such a
> slow down. After people found out that it just wasn't something silly,
> then you have a way to enable it.
>
> I'd still wager that most people would rather have a 'good enough
> fsync' on their desktops than incur the penalty of barriers or write
> through caching. I know I do.
>
>   
>> Furthermore, a blatantly obvious place to flush data to media --  
>> fsync(2), fdatasync(2) and sync_file_range(2) -- should cause the block  
>> layer to issue a FLUSH CACHE for __any__ filesystem.  But that doesn't  
>> happen either.
>>
>> So, no, for 95% of Linux users, fsync does _not_ already do that.  If  
>> you are lucky enough to use XFS or ext4, you're covered.  That's it.
>>     
>
> The point is that you need to expose this choice somewhere, and that
> 'somewhere' isn't manually editing fstab and enabling barriers or
> fsync-for-real. And it should be easier.
>
> Another problem is that FLUSH_CACHE sucks. Really. And not just on
> ext3/ordered, generally. Write a 50 byte file, fsync, flush cache and
> wit for the world to finish. Pretty hard to teach people to use a nicer
> fdatasync(), when the majority of the cost now becomes flushing the
> cache of that 1TB drive you happen to have 8 partitions on. Good luck
> with that.
>
>   
And, as I am sure that you do know, to add insult to injury, FLUSH_CACHE 
is per device (not file system).

When you issue an fsync() on a disk with multiple partitions, you will 
flush the data for all of its partitions from the write cache....

ric


^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-25 19:49                             ` Ric Wheeler
@ 2009-03-25 19:57                               ` Jens Axboe
  2009-03-25 20:41                                 ` Hugh Dickins
  2009-03-25 20:16                               ` Jeff Garzik
  1 sibling, 1 reply; 664+ messages in thread
From: Jens Axboe @ 2009-03-25 19:57 UTC (permalink / raw)
  To: Ric Wheeler
  Cc: Jeff Garzik, Linus Torvalds, Theodore Tso, Ingo Molnar, Alan Cox,
	Arjan van de Ven, Andrew Morton, Peter Zijlstra, Nick Piggin,
	David Rees, Jesper Krogh, Linux Kernel Mailing List

On Wed, Mar 25 2009, Ric Wheeler wrote:
> Jens Axboe wrote:
>> On Wed, Mar 25 2009, Jeff Garzik wrote:
>>   
>>> Jens Axboe wrote:
>>>     
>>>> On Tue, Mar 24 2009, Jeff Garzik wrote:
>>>>       
>>>>> Linus Torvalds wrote:
>>>>>         
>>>>>> But I really don't understand filesystem people who think that  
>>>>>> "fsck" is the important part, regardless of whether the data is 
>>>>>> valid or not. That's just stupid and _obviously_ bogus.
>>>>>>           
>>>>> I think I can understand that point of view, at least:
>>>>>
>>>>> More customers complain about hours-long fsck times than they do  
>>>>> about  silent data corruption of non-fsync'd files.
>>>>>
>>>>>
>>>>>         
>>>>>> The point is, if you write your metadata earlier (say, every 5 
>>>>>> sec) and the real data later (say, every 30 sec), you're 
>>>>>> actually MORE LIKELY to see corrupt files than if you try to 
>>>>>> write them together.
>>>>>>
>>>>>> And if you write your data _first_, you're never going to see   
>>>>>> corruption at all.
>>>>>>           
>>>>> Amen.
>>>>>
>>>>> And, personal filesystem pet peeve:  please encourage proper 
>>>>> FLUSH CACHE  use to give users the data guarantees they deserve.  
>>>>> Linux's sync(2) and  fsync(2) (and fdatasync, etc.) should poke 
>>>>> the block layer to guarantee  a media write.
>>>>>         
>>>> fsync already does that, at least if you have barriers enabled on your
>>>> drive.
>>>>       
>>> Erm, no, you don't enable barriers on your drive, they are not a   
>>> hardware feature.  You enable barriers via your filesystem.
>>>     
>>
>> Thanks for the lesson Jeff, I'm obviously not aware how that stuff
>> works...
>>
>>   
>>> Stating "fsync already does that" borders on false, because that assumes
>>> (a) the user has a fs that supports barriers
>>> (b) the user is actually aware of a 'barriers' mount option and what 
>>> it  means
>>> (c) the user has turned on an option normally defaulted to off.
>>>
>>> Or in other words, it pretty much never happens.
>>>     
>>
>> That is true, except if you use xfs/ext4. And this discussion is fine,
>> as was the one a few months back that got ext4 to enable barriers by
>> default. If I had submitted patches to do that back in 2001/2 when the
>> barrier stuff was written, I would have been shot for introducing such a
>> slow down. After people found out that it just wasn't something silly,
>> then you have a way to enable it.
>>
>> I'd still wager that most people would rather have a 'good enough
>> fsync' on their desktops than incur the penalty of barriers or write
>> through caching. I know I do.
>>
>>   
>>> Furthermore, a blatantly obvious place to flush data to media --   
>>> fsync(2), fdatasync(2) and sync_file_range(2) -- should cause the 
>>> block  layer to issue a FLUSH CACHE for __any__ filesystem.  But that 
>>> doesn't  happen either.
>>>
>>> So, no, for 95% of Linux users, fsync does _not_ already do that.  If 
>>>  you are lucky enough to use XFS or ext4, you're covered.  That's it.
>>>     
>>
>> The point is that you need to expose this choice somewhere, and that
>> 'somewhere' isn't manually editing fstab and enabling barriers or
>> fsync-for-real. And it should be easier.
>>
>> Another problem is that FLUSH_CACHE sucks. Really. And not just on
>> ext3/ordered, generally. Write a 50 byte file, fsync, flush cache and
>> wit for the world to finish. Pretty hard to teach people to use a nicer
>> fdatasync(), when the majority of the cost now becomes flushing the
>> cache of that 1TB drive you happen to have 8 partitions on. Good luck
>> with that.
>>
>>   
> And, as I am sure that you do know, to add insult to injury, FLUSH_CACHE  
> is per device (not file system).
>
> When you issue an fsync() on a disk with multiple partitions, you will  
> flush the data for all of its partitions from the write cache....

Exactly, that's what my (vague) 8 partition reference was for :-)
A range flush would be so much more palatable.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-25 19:49                             ` Ric Wheeler
  2009-03-25 19:57                               ` Jens Axboe
@ 2009-03-25 20:16                               ` Jeff Garzik
  2009-03-25 20:25                                 ` Ric Wheeler
  2009-03-25 21:27                                 ` Linux 2.6.29 Benny Halevy
  1 sibling, 2 replies; 664+ messages in thread
From: Jeff Garzik @ 2009-03-25 20:16 UTC (permalink / raw)
  To: Ric Wheeler
  Cc: Jens Axboe, Linus Torvalds, Theodore Tso, Ingo Molnar, Alan Cox,
	Arjan van de Ven, Andrew Morton, Peter Zijlstra, Nick Piggin,
	David Rees, Jesper Krogh, Linux Kernel Mailing List

Ric Wheeler wrote:> And, as I am sure that you do know, to add insult to 
injury, FLUSH_CACHE
> is per device (not file system).
> 
> When you issue an fsync() on a disk with multiple partitions, you will 
> flush the data for all of its partitions from the write cache....

SCSI'S SYNCHRONIZE CACHE command already accepts an (LBA, length) pair. 
  We could make use of that.

And I bet we could convince T13 to add FLUSH CACHE RANGE, if we could 
demonstrate clear benefit.

	Jeff




^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-25 20:16                               ` Jeff Garzik
@ 2009-03-25 20:25                                 ` Ric Wheeler
  2009-03-25 21:22                                   ` James Bottomley
  2009-03-25 21:27                                 ` Linux 2.6.29 Benny Halevy
  1 sibling, 1 reply; 664+ messages in thread
From: Ric Wheeler @ 2009-03-25 20:25 UTC (permalink / raw)
  To: Jeff Garzik, James Bottomley
  Cc: Ric Wheeler, Jens Axboe, Linus Torvalds, Theodore Tso,
	Ingo Molnar, Alan Cox, Arjan van de Ven, Andrew Morton,
	Peter Zijlstra, Nick Piggin, David Rees, Jesper Krogh,
	Linux Kernel Mailing List

Jeff Garzik wrote:
> Ric Wheeler wrote:> And, as I am sure that you do know, to add insult 
> to injury, FLUSH_CACHE
>> is per device (not file system).
>>
>> When you issue an fsync() on a disk with multiple partitions, you 
>> will flush the data for all of its partitions from the write cache....
>
> SCSI'S SYNCHRONIZE CACHE command already accepts an (LBA, length) 
> pair.  We could make use of that.
>
> And I bet we could convince T13 to add FLUSH CACHE RANGE, if we could 
> demonstrate clear benefit.
>
>     Jeff

How well supported is this in SCSI?  Can we try it out with a commodity 
SAS drive?

Ric


^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-25 19:43                           ` Jens Axboe
  2009-03-25 19:49                             ` Ric Wheeler
@ 2009-03-25 20:25                             ` Jeff Garzik
  2009-03-25 20:40                               ` Linus Torvalds
  2009-03-27  7:46                               ` Jens Axboe
  2009-03-31 20:49                             ` Jeff Garzik
  2 siblings, 2 replies; 664+ messages in thread
From: Jeff Garzik @ 2009-03-25 20:25 UTC (permalink / raw)
  Cc: Linus Torvalds, Theodore Tso, Ingo Molnar, Alan Cox,
	Arjan van de Ven, Andrew Morton, Peter Zijlstra, Nick Piggin,
	David Rees, Jesper Krogh, Linux Kernel Mailing List

Jens Axboe wrote:
> On Wed, Mar 25 2009, Jeff Garzik wrote:
>> Stating "fsync already does that" borders on false, because that assumes
>> (a) the user has a fs that supports barriers
>> (b) the user is actually aware of a 'barriers' mount option and what it  
>> means
>> (c) the user has turned on an option normally defaulted to off.
>>
>> Or in other words, it pretty much never happens.
> 
> That is true, except if you use xfs/ext4. And this discussion is fine,
> as was the one a few months back that got ext4 to enable barriers by
> default. If I had submitted patches to do that back in 2001/2 when the
> barrier stuff was written, I would have been shot for introducing such a
> slow down. After people found out that it just wasn't something silly,
> then you have a way to enable it.
> 
> I'd still wager that most people would rather have a 'good enough
> fsync' on their desktops than incur the penalty of barriers or write
> through caching. I know I do.

That's a strawman argument:  The choice is not between "good enough 
fsync" and full use of barriers / write-through caching, at all.

It is clearly possible to implement an fsync(2) that causes FLUSH CACHE 
to be issued, without adding full barrier support to a filesystem.  It 
is likely doable to avoid touching per-filesystem code at all, if we 
issue the flush from a generic fsync(2) code path in the kernel.

Thus, you have a "third way":  fsync(2) gives the guarantee it is 
supposed to, but you do not take the full performance hit of 
barriers-all-the-time.

Remember, fsync(2) means that the user _expects_ a performance hit.

And they took the extra step to call fsync(2) because they want a 
guarantee, not a lie.

	Jeff




^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-25 20:25                             ` Jeff Garzik
@ 2009-03-25 20:40                               ` Linus Torvalds
  2009-03-25 20:57                                 ` Ric Wheeler
                                                   ` (3 more replies)
  2009-03-27  7:46                               ` Jens Axboe
  1 sibling, 4 replies; 664+ messages in thread
From: Linus Torvalds @ 2009-03-25 20:40 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Theodore Tso, Ingo Molnar, Alan Cox, Arjan van de Ven,
	Andrew Morton, Peter Zijlstra, Nick Piggin, David Rees,
	Jesper Krogh, Linux Kernel Mailing List



On Wed, 25 Mar 2009, Jeff Garzik wrote:
> 
> It is clearly possible to implement an fsync(2) that causes FLUSH CACHE to be
> issued, without adding full barrier support to a filesystem.  It is likely
> doable to avoid touching per-filesystem code at all, if we issue the flush
> from a generic fsync(2) code path in the kernel.

We could easily do that. It would even work for most cases. The 
problematic ones are where filesystems do their own disk management, but I 
guess those people can do their own fsync() management too.

Somebody send me the patch, we can try it out.

> Remember, fsync(2) means that the user _expects_ a performance hit.

Within reason, though.

OS X, for example, doesn't do the disk barrier. It requires you to do a 
separate FULL_FSYNC (or something similar) ioctl to get that. Apparently 
exactly because users don't expect quite _that_ big of a performance hit.

(Or maybe just because it was easier to do that way. Never attribute to 
malice what can be sufficiently explained by stupidity).

			Linus

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-25 19:57                               ` Jens Axboe
@ 2009-03-25 20:41                                 ` Hugh Dickins
  2009-03-26  8:57                                   ` Jens Axboe
  0 siblings, 1 reply; 664+ messages in thread
From: Hugh Dickins @ 2009-03-25 20:41 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Ric Wheeler, Jeff Garzik, Linus Torvalds, Theodore Tso,
	Ingo Molnar, Alan Cox, Arjan van de Ven, Andrew Morton,
	Peter Zijlstra, Nick Piggin, David Rees, Jesper Krogh,
	Linux Kernel Mailing List

On Wed, 25 Mar 2009, Jens Axboe wrote:
> On Wed, Mar 25 2009, Ric Wheeler wrote:
> > Jens Axboe wrote:
> >>
> >> Another problem is that FLUSH_CACHE sucks. Really. And not just on
> >> ext3/ordered, generally. Write a 50 byte file, fsync, flush cache and
> >> wit for the world to finish. Pretty hard to teach people to use a nicer
> >> fdatasync(), when the majority of the cost now becomes flushing the
> >> cache of that 1TB drive you happen to have 8 partitions on. Good luck
> >> with that.
> >>   
> > And, as I am sure that you do know, to add insult to injury, FLUSH_CACHE  
> > is per device (not file system).
> >
> > When you issue an fsync() on a disk with multiple partitions, you will  
> > flush the data for all of its partitions from the write cache....
> 
> Exactly, that's what my (vague) 8 partition reference was for :-)
> A range flush would be so much more palatable.

Tangential question, but am I right in thinking that BIO_RW_BARRIER
similarly bars across all partitions, whereas its WRITE_BARRIER and
DISCARD_BARRIER users would actually prefer it to apply to just one?

Hugh

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-25 18:58                     ` Theodore Tso
  2009-03-25 19:48                       ` Christoph Hellwig
@ 2009-03-25 20:45                       ` Linus Torvalds
  2009-03-25 21:51                         ` Theodore Tso
  1 sibling, 1 reply; 664+ messages in thread
From: Linus Torvalds @ 2009-03-25 20:45 UTC (permalink / raw)
  To: Theodore Tso
  Cc: Jan Kara, Andrew Morton, Ingo Molnar, Alan Cox, Arjan van de Ven,
	Peter Zijlstra, Nick Piggin, Jens Axboe, David Rees,
	Jesper Krogh, Linux Kernel Mailing List



On Wed, 25 Mar 2009, Theodore Tso wrote:
> 
> Now, there are three ways of solving this problem.

You seem to disregard the "write in the right order" approach. Or is that 
your:

> The third potential solution we can try doing is to make some tuning
> adjustments to the VM so that we start pushing out these data blocks
> much more aggressively out to the disk.

Yes. but at least one problem is, as mentioned, that when the VM calls 
writepage[s]() to start async writeback, many filesystems do seem to just 
_block_ on it.

So the VM has a really hard time doing anything sanely early - the 
filesystems seem to take a perverse pleasure in synchronizing things using 
blocking semaphores.

			Linus

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-25 20:40                               ` Linus Torvalds
@ 2009-03-25 20:57                                 ` Ric Wheeler
  2009-03-25 23:02                                   ` Linus Torvalds
  2009-03-25 21:29                                 ` [PATCH] issue storage device flush via sync_blockdev() (was Re: Linux 2.6.29) Jeff Garzik
                                                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 664+ messages in thread
From: Ric Wheeler @ 2009-03-25 20:57 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jeff Garzik, Theodore Tso, Ingo Molnar, Alan Cox,
	Arjan van de Ven, Andrew Morton, Peter Zijlstra, Nick Piggin,
	David Rees, Jesper Krogh, Linux Kernel Mailing List

Linus Torvalds wrote:
> On Wed, 25 Mar 2009, Jeff Garzik wrote:
>   
>> It is clearly possible to implement an fsync(2) that causes FLUSH CACHE to be
>> issued, without adding full barrier support to a filesystem.  It is likely
>> doable to avoid touching per-filesystem code at all, if we issue the flush
>> from a generic fsync(2) code path in the kernel.
>>     
>
> We could easily do that. It would even work for most cases. The 
> problematic ones are where filesystems do their own disk management, but I 
> guess those people can do their own fsync() management too.
>   

One concern with doing this above the file system is that you are not in 
the context of a transaction so you have no clean promises about what is 
on disk and persistent when. Flushing the cache is primitive at best, 
but the way barriers work today is designed to give the transactions 
some pretty critical ordering semantics for journalling file systems at 
least.

I don't see how you could use this approach to make a really robust, 
failure proof storage system, but it might appear to work most of the 
time for most people :-)

ric

> Somebody send me the patch, we can try it out.
>
>   
>> Remember, fsync(2) means that the user _expects_ a performance hit.
>>     
>
> Within reason, though.
>
> OS X, for example, doesn't do the disk barrier. It requires you to do a 
> separate FULL_FSYNC (or something similar) ioctl to get that. Apparently 
> exactly because users don't expect quite _that_ big of a performance hit.
>
> (Or maybe just because it was easier to do that way. Never attribute to 
> malice what can be sufficiently explained by stupidity).
>
> 			Linus
>
>   



^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-25 20:25                                 ` Ric Wheeler
@ 2009-03-25 21:22                                   ` James Bottomley
  2009-03-26  8:59                                     ` Jens Axboe
  2009-03-30 19:05                                     ` range-based cache flushing (was Re: Linux 2.6.29) Jeff Garzik
  0 siblings, 2 replies; 664+ messages in thread
From: James Bottomley @ 2009-03-25 21:22 UTC (permalink / raw)
  To: Ric Wheeler
  Cc: Jeff Garzik, Jens Axboe, Linus Torvalds, Theodore Tso,
	Ingo Molnar, Alan Cox, Arjan van de Ven, Andrew Morton,
	Peter Zijlstra, Nick Piggin, David Rees, Jesper Krogh,
	Linux Kernel Mailing List

On Wed, 2009-03-25 at 16:25 -0400, Ric Wheeler wrote:
> Jeff Garzik wrote:
> > Ric Wheeler wrote:> And, as I am sure that you do know, to add insult 
> > to injury, FLUSH_CACHE
> >> is per device (not file system).
> >>
> >> When you issue an fsync() on a disk with multiple partitions, you 
> >> will flush the data for all of its partitions from the write cache....
> >
> > SCSI'S SYNCHRONIZE CACHE command already accepts an (LBA, length) 
> > pair.  We could make use of that.
> >
> > And I bet we could convince T13 to add FLUSH CACHE RANGE, if we could 
> > demonstrate clear benefit.
> >
> >     Jeff
> 
> How well supported is this in SCSI?  Can we try it out with a commodity 
> SAS drive?

What do you mean by well supported?  The way the SCSI standard is
written, a device can do a complete cache flush when a range flush is
requested and still be fully standards compliant.  There's no easy way
to tell if it does a complete cache flush every time other than by
taking the firmware apart (or asking the manufacturer).

James



^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-25 20:16                               ` Jeff Garzik
  2009-03-25 20:25                                 ` Ric Wheeler
@ 2009-03-25 21:27                                 ` Benny Halevy
  1 sibling, 0 replies; 664+ messages in thread
From: Benny Halevy @ 2009-03-25 21:27 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Ric Wheeler, Jens Axboe, Linus Torvalds, Theodore Tso,
	Ingo Molnar, Alan Cox, Arjan van de Ven, Andrew Morton,
	Peter Zijlstra, Nick Piggin, David Rees, Jesper Krogh,
	Linux Kernel Mailing List

On Mar. 25, 2009, 22:16 +0200, Jeff Garzik <jeff@garzik.org> wrote:
> Ric Wheeler wrote:> And, as I am sure that you do know, to add insult to 
> injury, FLUSH_CACHE
>> is per device (not file system).
>>
>> When you issue an fsync() on a disk with multiple partitions, you will 
>> flush the data for all of its partitions from the write cache....
> 
> SCSI'S SYNCHRONIZE CACHE command already accepts an (LBA, length) pair. 
>   We could make use of that.
> 
> And I bet we could convince T13 to add FLUSH CACHE RANGE, if we could 
> demonstrate clear benefit.

One more example of flexible, fine grain flush (though quite far out) are
T10 OSDs with which you can flush a byte range of a single object
(or collection, partition, or the whole device LUN)

Benny

> 
> 	Jeff
> 
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/


^ permalink raw reply	[flat|nested] 664+ messages in thread

* [PATCH] issue storage device flush via sync_blockdev() (was Re: Linux 2.6.29)
  2009-03-25 20:40                               ` Linus Torvalds
  2009-03-25 20:57                                 ` Ric Wheeler
@ 2009-03-25 21:29                                 ` Jeff Garzik
  2009-03-25 21:56                                   ` Eric Sandeen
                                                     ` (2 more replies)
  2009-03-25 21:33                                 ` Linux 2.6.29 Jeff Garzik
  2009-03-27  7:57                                 ` Jens Axboe
  3 siblings, 3 replies; 664+ messages in thread
From: Jeff Garzik @ 2009-03-25 21:29 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Theodore Tso, Ingo Molnar, Alan Cox, Arjan van de Ven,
	Andrew Morton, Peter Zijlstra, Nick Piggin, David Rees,
	Jesper Krogh, Linux Kernel Mailing List

On Wed, Mar 25, 2009 at 01:40:37PM -0700, Linus Torvalds wrote:
> On Wed, 25 Mar 2009, Jeff Garzik wrote:
> > It is clearly possible to implement an fsync(2) that causes FLUSH CACHE to be
> > issued, without adding full barrier support to a filesystem.  It is likely
> > doable to avoid touching per-filesystem code at all, if we issue the flush
> > from a generic fsync(2) code path in the kernel.
> 
> We could easily do that. It would even work for most cases. The 
> problematic ones are where filesystems do their own disk management, but I 
> guess those people can do their own fsync() management too.
> 
> Somebody send me the patch, we can try it out.

This is a simple step that would cover a lot of cases...  sync(2)
calls sync_blockdev(), and many filesystems do as well via the generic
filesystem helper file_fsync (fs/sync.c).

XFS code calls sync_blockdev() a "big hammer", so I hope my patch
follows with known practice.

Looking over every use of sync_blockdev(), its most frequent use is
through fsync(2), for the selected filesystems that use the generic
file_fsync helper.

Most callers of sync_blockdev() in the kernel do so infrequently,
when removing and invalidating volumes (MD) or storing the superblock
prior to release (put_super) in some filesystems.

Compile-tested only, of course :)  But it should be work :)

My main concern is some hidden area that calls sync_blockdev() with
a high-enough frequency that the performance hit is bad.

Signed-off-by: Jeff Garzik <jgarzik@redhat.com>

diff --git a/fs/buffer.c b/fs/buffer.c
index 891e1c7..7b9f74a 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -173,9 +173,14 @@ int sync_blockdev(struct block_device *bdev)
 {
 	int ret = 0;
 
-	if (bdev)
-		ret = filemap_write_and_wait(bdev->bd_inode->i_mapping);
-	return ret;
+	if (!bdev)
+		return 0;
+	
+	ret = filemap_write_and_wait(bdev->bd_inode->i_mapping);
+	if (ret)
+		return ret;
+	
+	return blkdev_issue_flush(bdev, NULL);
 }
 EXPORT_SYMBOL(sync_blockdev);
 

^ permalink raw reply related	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-25 20:40                               ` Linus Torvalds
  2009-03-25 20:57                                 ` Ric Wheeler
  2009-03-25 21:29                                 ` [PATCH] issue storage device flush via sync_blockdev() (was Re: Linux 2.6.29) Jeff Garzik
@ 2009-03-25 21:33                                 ` Jeff Garzik
  2009-03-27  7:57                                 ` Jens Axboe
  3 siblings, 0 replies; 664+ messages in thread
From: Jeff Garzik @ 2009-03-25 21:33 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Theodore Tso, Ingo Molnar, Alan Cox, Arjan van de Ven,
	Andrew Morton, Peter Zijlstra, Nick Piggin, David Rees,
	Jesper Krogh, Linux Kernel Mailing List

Linus Torvalds wrote:
> OS X, for example, doesn't do the disk barrier. It requires you to do a 
> separate FULL_FSYNC (or something similar) ioctl to get that. Apparently 
> exactly because users don't expect quite _that_ big of a performance hit.

I can understand that, more from an admin standpoint than anything... 
ATA disks' FLUSH CACHE is horribly coarse-grained, all-or-nothing.

SCSI's SYNCHRONIZE CACHE at least gives us an optional (LBA, length) 
pair that can be used to avoid to flushing everything in the cache.

Microsoft has publicly proposed a WRITE BARRIER command for ATA, to try 
and improve the situation:
http://www.t13.org/Documents/UploadedDocuments/docs2007/e07174r0-Write_Barrier_Command_Proposal.doc

but that isn't in the field yet (if ever?)

	Jeff



^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-25 19:48                       ` Christoph Hellwig
@ 2009-03-25 21:50                         ` Theodore Tso
  2009-03-26  2:10                           ` Matthew Garrett
  0 siblings, 1 reply; 664+ messages in thread
From: Theodore Tso @ 2009-03-25 21:50 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Linus Torvalds, Jan Kara, Andrew Morton, Ingo Molnar, Alan Cox,
	Arjan van de Ven, Peter Zijlstra, Nick Piggin, Jens Axboe,
	David Rees, Jesper Krogh, Linux Kernel Mailing List

On Wed, Mar 25, 2009 at 03:48:51PM -0400, Christoph Hellwig wrote:
> On Wed, Mar 25, 2009 at 02:58:24PM -0400, Theodore Tso wrote:
> > omits the fsync().  So with ext4 we has workarounds that start pushing
> > out the data blocks in the for replace-via-rename and
> > replace-via-truncate cases, while XFS will do an implied fsync for
> > replace-via-truncate only, and btrfs will do an implied fsync for
> > replace-via-rename only.
> 
> The XFS one and the ext4 one that I saw only start an _asynchronous_
> writeout.  Which is not an implied fsync but snake oil to make the
> most common complaints go away without providing hard guarantees.

It actually does the right thing for ext4, because once we allocate
the blocks, the default data=ordered mode means that we flush the
datablocks before we execute the commit.  Hence, in the case of
open/write/close/rename, the rename will trigger an async writeout,
but before the commit block is actually written, we'll have flushed
out the data blocks.

I was under the impression that XFS was doing a synchronous fsync
before allowing the close() return, but all it is triggering an async
writeout, then yes, your concern is correct.  The bigger problem from
my perspective is that XFS is only doing this for the truncate case,
and (from what I've been told) not for the rename case.  The truncate
is fundamentally racy and application writers that don't do this
definitely don't deserve our solicitude, IMHO.  But people who do
open/write/close/rename, and omit the fsync before the rename, are at
least somewhat more deserving for some kind of workaround than the
idiots that do open/truncate/write/close.

> IFF we want to go down this route we should better provide strong
> guranteed semantics and document the propery.  And of course implement
> it consistently on all native filesystems.

That's something we should talk about at LSF.  I'm not all that eager
(or happy) about doing this, but I think that, given that the
application writers massively outnumber us, we are going to be bullied
into it.

> Note that the rename for atomic commits trick originated in mail severs
> which always did the proper fsync.  When the word spread into the
> desktop world it looks like this wisdom got lost.

Yep, agreed.

To be fair, though, one problem which Matthew Garrett has pointed out
is that if lots of applications issue fsync(), it will have the
tendency to wake up the hard drive a lot, and do a real number on
power utilization.  I believe the right solution for this is an
extension to laptop mode which synchronizes the filesystem at a clean
point, and then which suppresses fsync()'s until the hard drive wakes
up, at which point it should flush all dirty data to the drive, and
then freezes writes to the disk again.  Presumably that should be OK,
because who are using laptop mode are inherently trading off a certain
amount of safety for power savings; but then other people who want to
run a mysql server on a laptop get cranky, and then if we start
implementing ways that applications can exempt themselves from the
fsync() suppression, the complexity level starts rising.

This is a pretty complicated problem....  if people want to mount the
filesystem with the sync mount option, sure, but when people want
safety, speed, efficiency, power savings, *and* they want to use
crappy proprietary device drivers that crash if you look at them
funny, *and* be solicitous to application writers that rewrite
hundreds of files on desktop startup (even though it's not clear *why*
it is useful for KDE or GNOME to rewrite hundreds of files when the
user logs in and initializes the desktop), something has got to give.

There's nothing to trade off, other than the sanity of the file system
maintainers.  (But that's OK, Linus has called us crazy already.  :-/)

	      	   	      	    	       - Ted

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-25 20:45                       ` Linus Torvalds
@ 2009-03-25 21:51                         ` Theodore Tso
  2009-03-25 23:21                           ` Linus Torvalds
  0 siblings, 1 reply; 664+ messages in thread
From: Theodore Tso @ 2009-03-25 21:51 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jan Kara, Andrew Morton, Ingo Molnar, Alan Cox, Arjan van de Ven,
	Peter Zijlstra, Nick Piggin, Jens Axboe, David Rees,
	Jesper Krogh, Linux Kernel Mailing List

On Wed, Mar 25, 2009 at 01:45:43PM -0700, Linus Torvalds wrote:
> > The third potential solution we can try doing is to make some tuning
> > adjustments to the VM so that we start pushing out these data blocks
> > much more aggressively out to the disk.
> 
> Yes. but at least one problem is, as mentioned, that when the VM calls 
> writepage[s]() to start async writeback, many filesystems do seem to just 
> _block_ on it.

Um, no, ext3 shouldn't block on writepage().  Since it doesn't do
delayed allocation, it should always be able to push out a dirty page
to the disk.

						- Ted

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: [PATCH] issue storage device flush via sync_blockdev() (was Re: Linux 2.6.29)
  2009-03-25 21:29                                 ` [PATCH] issue storage device flush via sync_blockdev() (was Re: Linux 2.6.29) Jeff Garzik
@ 2009-03-25 21:56                                   ` Eric Sandeen
  2009-03-25 23:08                                     ` Jeff Garzik
                                                       ` (2 more replies)
  2009-03-25 22:01                                   ` Alan Cox
  2009-03-26  3:24                                   ` [PATCH v2] issue storage device flush via sync_blockdev() Jeff Garzik
  2 siblings, 3 replies; 664+ messages in thread
From: Eric Sandeen @ 2009-03-25 21:56 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Linus Torvalds, Theodore Tso, Ingo Molnar, Alan Cox,
	Arjan van de Ven, Andrew Morton, Peter Zijlstra, Nick Piggin,
	David Rees, Jesper Krogh, Linux Kernel Mailing List

Jeff Garzik wrote:
> On Wed, Mar 25, 2009 at 01:40:37PM -0700, Linus Torvalds wrote:
>> On Wed, 25 Mar 2009, Jeff Garzik wrote:
>>> It is clearly possible to implement an fsync(2) that causes FLUSH CACHE to be
>>> issued, without adding full barrier support to a filesystem.  It is likely
>>> doable to avoid touching per-filesystem code at all, if we issue the flush
>>> from a generic fsync(2) code path in the kernel.
>> We could easily do that. It would even work for most cases. The 
>> problematic ones are where filesystems do their own disk management, but I 
>> guess those people can do their own fsync() management too.
>>
>> Somebody send me the patch, we can try it out.
> 
> This is a simple step that would cover a lot of cases...  sync(2)
> calls sync_blockdev(), and many filesystems do as well via the generic
> filesystem helper file_fsync (fs/sync.c).
> 
> XFS code calls sync_blockdev() a "big hammer", so I hope my patch
> follows with known practice.
> 
> Looking over every use of sync_blockdev(), its most frequent use is
> through fsync(2), for the selected filesystems that use the generic
> file_fsync helper.
> 
> Most callers of sync_blockdev() in the kernel do so infrequently,
> when removing and invalidating volumes (MD) or storing the superblock
> prior to release (put_super) in some filesystems.
> 
> Compile-tested only, of course :)  But it should be work :)
> 
> My main concern is some hidden area that calls sync_blockdev() with
> a high-enough frequency that the performance hit is bad.
> 
> Signed-off-by: Jeff Garzik <jgarzik@redhat.com>
> 
> diff --git a/fs/buffer.c b/fs/buffer.c
> index 891e1c7..7b9f74a 100644
> --- a/fs/buffer.c
> +++ b/fs/buffer.c
> @@ -173,9 +173,14 @@ int sync_blockdev(struct block_device *bdev)
>  {
>  	int ret = 0;
>  
> -	if (bdev)
> -		ret = filemap_write_and_wait(bdev->bd_inode->i_mapping);
> -	return ret;
> +	if (!bdev)
> +		return 0;
> +	
> +	ret = filemap_write_and_wait(bdev->bd_inode->i_mapping);
> +	if (ret)
> +		return ret;
> +	
> +	return blkdev_issue_flush(bdev, NULL);
>  }
>  EXPORT_SYMBOL(sync_blockdev);

What about when you're running over a big raid device with
battery-backed cache, and you trust the cache as much as much as the
disks.  Wouldn't this unconditional cache flush be painful there on any
of the callers even if they're rare?  (fs unmounts, freezes, unmounts,
etc?  Or a fat filesystem on that device doing an fsync?)

xfs, reiserfs, ext4 all avoid the blkdev flush on fsync if barriers are
not enabled, I think for that reason...

(I'm assuming these raid devices still honor a cache flush request even
if they're battery-backed?  I dunno.)

-Eric

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: [PATCH] issue storage device flush via sync_blockdev() (was Re: Linux 2.6.29)
  2009-03-25 21:29                                 ` [PATCH] issue storage device flush via sync_blockdev() (was Re: Linux 2.6.29) Jeff Garzik
  2009-03-25 21:56                                   ` Eric Sandeen
@ 2009-03-25 22:01                                   ` Alan Cox
  2009-03-25 23:12                                     ` Jeff Garzik
  2009-03-26  3:24                                   ` [PATCH v2] issue storage device flush via sync_blockdev() Jeff Garzik
  2 siblings, 1 reply; 664+ messages in thread
From: Alan Cox @ 2009-03-25 22:01 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Linus Torvalds, Theodore Tso, Ingo Molnar, Arjan van de Ven,
	Andrew Morton, Peter Zijlstra, Nick Piggin, David Rees,
	Jesper Krogh, Linux Kernel Mailing List

> This is a simple step that would cover a lot of cases...  sync(2)
> calls sync_blockdev(), and many filesystems do as well via the generic
> filesystem helper file_fsync (fs/sync.c).

file_fsync probably needs to pass down more information so you can make
this a mount option. It's going to depend on the application whether the
flush is good bad or indifferent.

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Revert "gro: Fix legacy path napi_complete crash",
  2009-03-25 12:26                     ` Herbert Xu
@ 2009-03-25 22:01                       ` Ingo Molnar
  2009-03-25 22:20                         ` Ken Witherow
  2009-03-26  9:07                         ` Herbert Xu
  2009-03-25 22:54                       ` Jarek Poplawski
  1 sibling, 2 replies; 664+ messages in thread
From: Ingo Molnar @ 2009-03-25 22:01 UTC (permalink / raw)
  To: Herbert Xu
  Cc: David Miller, r.schwebel, torvalds, blaschka, tglx, a.p.zijlstra,
	linux-kernel, kernel


* Herbert Xu <herbert@gondor.apana.org.au> wrote:

> On Wed, Mar 25, 2009 at 01:20:46PM +0100, Ingo Molnar wrote:
> > 
> > ok - i have started testing the delta below, on top of the plain 
> > revert.

it's still fine btw, so:

Tested-by: Ingo Molnar <mingo@elte.hu>

> Thanks! BTW Ingo, any chance you could help us identify the 
> problem with the previous patch? I don't have a forcedeth machine 
> here and the hang you had with my patch that open-coded 
> __napi_complete appears intimately connected to forcedeth (with 
> NAPI enabled).
> 
> The simplest thing to try would be to build forcedeth.c with DEBUG
> and see what it prints out after it locks up.

Sure, can try that. Probably the best would be if you sent me a 
combo patch with the precise patch you meant me to try (there were 
several patches, i'm not sure which one is the 'previous' one) plus 
the forcedeth debug enable change as well.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-25 18:40           ` Linus Torvalds
@ 2009-03-25 22:05             ` Theodore Tso
  2009-03-25 23:23               ` Linus Torvalds
  2009-03-26  2:50               ` Neil Brown
  0 siblings, 2 replies; 664+ messages in thread
From: Theodore Tso @ 2009-03-25 22:05 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: David Rees, Jesper Krogh, Linux Kernel Mailing List

On Wed, Mar 25, 2009 at 11:40:28AM -0700, Linus Torvalds wrote:
> On Wed, 25 Mar 2009, Theodore Tso wrote:
> > I'm beginning to think that using a "ratio" may be the wrong way to
> > go.  We probably need to add an optional dirty_max_megabytes field
> > where we start pushing dirty blocks out when the number of dirty
> > blocks exceeds either the dirty_ratio or the dirty_max_megabytes,
> > which ever comes first.
> 
> We have that. Except it's called "dirty_bytes" and 
> "dirty_background_bytes", and it defaults to zero (off).
> 
> The problem being that unlike the ratio, there's no sane default value 
> that you can at least argue is not _entirely_ pointless.

Well, if the maximum time that someone wants to wait for an fsync() to
return is one second, and the RAID array can write 100MB/sec, then
setting a value of 100MB makes a certain amount of sense.  Yes, this
doesn't take seek overheads into account, and it may be that we're not
writing things out in an optimal order, as Alan as pointed out.  But
100MB is much lower number than 5% of 32GB (1.6GB).  It would be
better if these numbers were accounted on a per-filesystem instead of
a global threshold, but for people who are complaining about huge
latencies, it at least a partial workaround that they can use today.

I agree, it's not perfect, but this is a fundamentally hard problem.
We have multiple solutions, such as ext4 and XFS's delayed allocation,
which some people don't like because applications aren't calling
fsync().  We can boost the I/O priority of kjournald which definitely
helps, as Arjan has suggested, but Andrew has vetoed that.  I have a
patch which hopefully is less controversial, that posts writes using
WRITE_SYNC instead of WRITE, but which only will help in some
circumstances, but not in the distcc/icecream/fast downloads
scnearios.  We can use data=writeback, but folks don't like the
security implications of that.

People can call file system developers idiots if it makes them feel
better --- sure, OK, we all suck.  If someone wants to try to create a
better file system, show us how to do better, or send us some patches.
But this is not a problem that's easy to solve in a way that's going
to make everyone happy; else it would have been solved already.

						- Ted

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Revert "gro: Fix legacy path napi_complete crash",
  2009-03-25 22:01                       ` Ingo Molnar
@ 2009-03-25 22:20                         ` Ken Witherow
  2009-03-26  9:07                         ` Herbert Xu
  1 sibling, 0 replies; 664+ messages in thread
From: Ken Witherow @ 2009-03-25 22:20 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Herbert Xu, David Miller, r.schwebel, torvalds, blaschka, tglx,
	a.p.zijlstra, linux-kernel, kernel

On Wed, 25 Mar 2009, Ingo Molnar wrote:

>
> * Herbert Xu <herbert@gondor.apana.org.au> wrote:
>
>> On Wed, Mar 25, 2009 at 01:20:46PM +0100, Ingo Molnar wrote:
>>>
>>> ok - i have started testing the delta below, on top of the plain
>>> revert.
>
> it's still fine btw, so:
>
> Tested-by: Ingo Molnar <mingo@elte.hu>

I saw your patch this morning and added it to my system too. 4 hours and 
15 minutes and everything is still fine here.



CONFIG_FORCEDETH=y
# CONFIG_FORCEDETH_NAPI is not set

with

00:08.0 Bridge: nVidia Corporation MCP55 Ethernet (rev a3)

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Revert "gro: Fix legacy path napi_complete crash",
  2009-03-25 12:26                     ` Herbert Xu
  2009-03-25 22:01                       ` Ingo Molnar
@ 2009-03-25 22:54                       ` Jarek Poplawski
  2009-03-26  0:03                         ` David Miller
                                           ` (2 more replies)
  1 sibling, 3 replies; 664+ messages in thread
From: Jarek Poplawski @ 2009-03-25 22:54 UTC (permalink / raw)
  To: Herbert Xu
  Cc: Ingo Molnar, David Miller, r.schwebel, torvalds, blaschka, tglx,
	a.p.zijlstra, linux-kernel, kernel

Herbert Xu wrote, On 03/25/2009 01:26 PM:

> On Wed, Mar 25, 2009 at 01:20:46PM +0100, Ingo Molnar wrote:
>> ok - i have started testing the delta below, on top of the plain 
>> revert.
> 
> Thanks! BTW Ingo, any chance you could help us identify the problem
> with the previous patch? I don't have a forcedeth machine here
> and the hang you had with my patch that open-coded __napi_complete
> appears intimately connected to forcedeth (with NAPI enabled).

Of course it's too late for verifying this now, but (for the future)
I think, this scenario could be considered:

process_backlog()			netif_rx()

if (!skb)
local_irq_enable()
					if (queue.qlen) //NO
					napi_schedule() //NOTHING
					__skb_queue_tail() //qlen > 0
napi_complete()
...					...
					Every next netif_rx() sees
					qlen > 0, so napi is never
					scheduled again.

Then, something like this might work...

Jarek P.
--- (2.6.29)
 net/core/dev.c |    6 +++++-
 1 files changed, 5 insertions(+), 1 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index e3fe5c7..cf53c24 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2589,7 +2589,11 @@ static int process_backlog(struct napi_struct *napi, int quota)
 		skb = __skb_dequeue(&queue->input_pkt_queue);
 		if (!skb) {
 			local_irq_enable();
-			napi_complete(napi);
+			napi_gro_flush(napi);
+			local_irq_disable();
+			if (skb_queue_empty(&queue->input_pkt_queue))
+				__napi_complete(napi);
+			local_irq_enable();
 			goto out;
 		}
 		local_irq_enable();

^ permalink raw reply related	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-25 20:57                                 ` Ric Wheeler
@ 2009-03-25 23:02                                   ` Linus Torvalds
  2009-03-26  0:28                                     ` Ric Wheeler
  0 siblings, 1 reply; 664+ messages in thread
From: Linus Torvalds @ 2009-03-25 23:02 UTC (permalink / raw)
  To: Ric Wheeler
  Cc: Jeff Garzik, Theodore Tso, Ingo Molnar, Alan Cox,
	Arjan van de Ven, Andrew Morton, Peter Zijlstra, Nick Piggin,
	David Rees, Jesper Krogh, Linux Kernel Mailing List



On Wed, 25 Mar 2009, Ric Wheeler wrote:
> 
> One concern with doing this above the file system is that you are not in the
> context of a transaction so you have no clean promises about what is on disk
> and persistent when. Flushing the cache is primitive at best, but the way
> barriers work today is designed to give the transactions some pretty critical
> ordering semantics for journalling file systems at least.
> 
> I don't see how you could use this approach to make a really robust, failure
> proof storage system, but it might appear to work most of the time for most
> people :-)

You just do a write barrier after doing all the filesystem writing, and 
you return with the guarantee that all the writes the filesystem did are 
actually on disk.

No gray areas. No questions. No "might appear to work". 

Sure, there might be other writes that got flushed _too_, but nobody 
cares. If you have a crash later on, that's always true - you don't get 
crashes at nice well-defined points.

			Linus

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: [PATCH] issue storage device flush via sync_blockdev() (was Re: Linux 2.6.29)
  2009-03-25 21:56                                   ` Eric Sandeen
@ 2009-03-25 23:08                                     ` Jeff Garzik
  2009-03-26  2:31                                       ` Eric Sandeen
  2009-03-26  0:58                                     ` Ric Wheeler
  2009-03-27  7:59                                     ` Jens Axboe
  2 siblings, 1 reply; 664+ messages in thread
From: Jeff Garzik @ 2009-03-25 23:08 UTC (permalink / raw)
  To: Eric Sandeen
  Cc: Linus Torvalds, Theodore Tso, Ingo Molnar, Alan Cox,
	Arjan van de Ven, Andrew Morton, Peter Zijlstra, Nick Piggin,
	David Rees, Jesper Krogh, Linux Kernel Mailing List

Eric Sandeen wrote:
> Jeff Garzik wrote:
>> On Wed, Mar 25, 2009 at 01:40:37PM -0700, Linus Torvalds wrote:
>>> On Wed, 25 Mar 2009, Jeff Garzik wrote:
>>>> It is clearly possible to implement an fsync(2) that causes FLUSH CACHE to be
>>>> issued, without adding full barrier support to a filesystem.  It is likely
>>>> doable to avoid touching per-filesystem code at all, if we issue the flush
>>>> from a generic fsync(2) code path in the kernel.
>>> We could easily do that. It would even work for most cases. The 
>>> problematic ones are where filesystems do their own disk management, but I 
>>> guess those people can do their own fsync() management too.
>>>
>>> Somebody send me the patch, we can try it out.
>> This is a simple step that would cover a lot of cases...  sync(2)
>> calls sync_blockdev(), and many filesystems do as well via the generic
>> filesystem helper file_fsync (fs/sync.c).
>>
>> XFS code calls sync_blockdev() a "big hammer", so I hope my patch
>> follows with known practice.
>>
>> Looking over every use of sync_blockdev(), its most frequent use is
>> through fsync(2), for the selected filesystems that use the generic
>> file_fsync helper.
>>
>> Most callers of sync_blockdev() in the kernel do so infrequently,
>> when removing and invalidating volumes (MD) or storing the superblock
>> prior to release (put_super) in some filesystems.
>>
>> Compile-tested only, of course :)  But it should be work :)
>>
>> My main concern is some hidden area that calls sync_blockdev() with
>> a high-enough frequency that the performance hit is bad.
>>
>> Signed-off-by: Jeff Garzik <jgarzik@redhat.com>
>>
>> diff --git a/fs/buffer.c b/fs/buffer.c
>> index 891e1c7..7b9f74a 100644
>> --- a/fs/buffer.c
>> +++ b/fs/buffer.c
>> @@ -173,9 +173,14 @@ int sync_blockdev(struct block_device *bdev)
>>  {
>>  	int ret = 0;
>>  
>> -	if (bdev)
>> -		ret = filemap_write_and_wait(bdev->bd_inode->i_mapping);
>> -	return ret;
>> +	if (!bdev)
>> +		return 0;
>> +	
>> +	ret = filemap_write_and_wait(bdev->bd_inode->i_mapping);
>> +	if (ret)
>> +		return ret;
>> +	
>> +	return blkdev_issue_flush(bdev, NULL);
>>  }
>>  EXPORT_SYMBOL(sync_blockdev);
> 
> What about when you're running over a big raid device with
> battery-backed cache, and you trust the cache as much as much as the
> disks.  Wouldn't this unconditional cache flush be painful there on any
> of the callers even if they're rare?  (fs unmounts, freezes, unmounts,
> etc?  Or a fat filesystem on that device doing an fsync?)

What exactly do you think sync_blockdev() does?  :)

It is used right before a volume goes away.  If that's not a time to 
flush the cache, I dunno what is.

The _whole purpose_ of sync_blockdev() is to push out the data to 
permanent storage.  Look at the users -- unmount volume, journal close, 
etc.  Things that are OK to occur after those points include: power off, 
device unplug, etc.

A secondary purpose of sync_blockdev() is as a hack, for simple/ancient 
bdev-based filesystems that do not wish to bother with barriers and all 
the associated complexity to tracking what writes do/do not need flushing.


> xfs, reiserfs, ext4 all avoid the blkdev flush on fsync if barriers are
> not enabled, I think for that reason...

Enabling barriers causes slowdowns far greater than that of simply 
causing fsync(2) to trigger FLUSH CACHE, because barriers imply FLUSH 
CACHE issuance for all in-kernel filesystem journalled/atomic 
transactions, in addition to whatever syscalls userspace is issuing.

The number of FLUSH CACHES w/ barriers is orders of magnitude larger 
than the number of fsync/fdatasync calls.

	Jeff




^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: [PATCH] issue storage device flush via sync_blockdev() (was Re: Linux 2.6.29)
  2009-03-25 22:01                                   ` Alan Cox
@ 2009-03-25 23:12                                     ` Jeff Garzik
  0 siblings, 0 replies; 664+ messages in thread
From: Jeff Garzik @ 2009-03-25 23:12 UTC (permalink / raw)
  To: Alan Cox
  Cc: Linus Torvalds, Theodore Tso, Ingo Molnar, Arjan van de Ven,
	Andrew Morton, Peter Zijlstra, Nick Piggin, David Rees,
	Jesper Krogh, Linux Kernel Mailing List

Alan Cox wrote:
>> This is a simple step that would cover a lot of cases...  sync(2)
>> calls sync_blockdev(), and many filesystems do as well via the generic
>> filesystem helper file_fsync (fs/sync.c).
> 
> file_fsync probably needs to pass down more information so you can make
> this a mount option. It's going to depend on the application whether the
> flush is good bad or indifferent.

file_fsync is only used by ancient legacy filesystems, who specifically 
don't want to bother with anything more complicated: HFS, HFS+, ADFS, 
AFFS, FAT, bfs, UFS, NTFS, qnx4.

IOW they _already_ consciously implement fsync(2) as "flush ENTIRE 
blockdev".

I think it is worth it to simply wait and see if mount options are even 
wanted.

	Jeff



^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-25 21:51                         ` Theodore Tso
@ 2009-03-25 23:21                           ` Linus Torvalds
  2009-03-25 23:50                             ` Jan Kara
  2009-03-25 23:57                             ` Linux 2.6.29 Linus Torvalds
  0 siblings, 2 replies; 664+ messages in thread
From: Linus Torvalds @ 2009-03-25 23:21 UTC (permalink / raw)
  To: Theodore Tso
  Cc: Jan Kara, Andrew Morton, Ingo Molnar, Alan Cox, Arjan van de Ven,
	Peter Zijlstra, Nick Piggin, Jens Axboe, David Rees,
	Jesper Krogh, Linux Kernel Mailing List



On Wed, 25 Mar 2009, Theodore Tso wrote:
> 
> Um, no, ext3 shouldn't block on writepage().  Since it doesn't do
> delayed allocation, it should always be able to push out a dirty page
> to the disk.

Umm. Maybe I'm mis-reading something, but they seem to all synchronize 
with the journal with "ext3_journal_start/stop".

Which will at a minimum wait for 'j_barrier_count == 0' and 't_state != 
T_LOCKED'. Along with making sure that there are enough transaction 
buffers.

Do I understand _why_ ext3 does that? Hell no. The code makes no sense to 
me. But I don't think I'm wrong.

Look at the sane case (data=ordered): it still does

	handle = ext3_journal_start(inode, ext3_writepage_trans_blocks(inode));
	...
	err = ext3_journal_stop(handle);

around all the IO starting. Never mind that the IO shouldn't be needing 
any journal activity at all afaik in any common case. 

Yes, yes, it may need to allocate backing store (a page that was dirtied 
by mmap), and I'm sure that's the reason for it all, but the point is, 
most of the time there should be no journal activity at all, yet it looks 
very much like a simple writepage() will synchronize with a full journal 
and wait for the journal to get space.

No?

So tell me again how the VM can rely on the filesystem not blocking at 
random points.

			Linus

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-25 22:05             ` Theodore Tso
@ 2009-03-25 23:23               ` Linus Torvalds
  2009-03-25 23:46                 ` Bron Gondwana
  2009-03-27  0:11                 ` Andrew Morton
  2009-03-26  2:50               ` Neil Brown
  1 sibling, 2 replies; 664+ messages in thread
From: Linus Torvalds @ 2009-03-25 23:23 UTC (permalink / raw)
  To: Theodore Tso; +Cc: David Rees, Jesper Krogh, Linux Kernel Mailing List



On Wed, 25 Mar 2009, Theodore Tso wrote:
> > 
> > The problem being that unlike the ratio, there's no sane default value 
> > that you can at least argue is not _entirely_ pointless.
> 
> Well, if the maximum time that someone wants to wait for an fsync() to
> return is one second, and the RAID array can write 100MB/sec

How are you going to tell the kernel that the RAID array can write 
100MB/s?

The kernel has no idea.

			Linus

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-25 23:23               ` Linus Torvalds
@ 2009-03-25 23:46                 ` Bron Gondwana
  2009-03-26  0:32                   ` Ric Wheeler
  2009-03-27  0:11                 ` Andrew Morton
  1 sibling, 1 reply; 664+ messages in thread
From: Bron Gondwana @ 2009-03-25 23:46 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Theodore Tso, David Rees, Jesper Krogh, Linux Kernel Mailing List

On Wed, Mar 25, 2009 at 04:23:08PM -0700, Linus Torvalds wrote:
> 
> 
> On Wed, 25 Mar 2009, Theodore Tso wrote:
> > > 
> > > The problem being that unlike the ratio, there's no sane default value 
> > > that you can at least argue is not _entirely_ pointless.
> > 
> > Well, if the maximum time that someone wants to wait for an fsync() to
> > return is one second, and the RAID array can write 100MB/sec
> 
> How are you going to tell the kernel that the RAID array can write 
> 100MB/s?
> 
> The kernel has no idea.

Not at boot up, but after it's been using the RAID array for a little
while it could...

Bron (... imagining a tunable "max_fsync_wait_target_centisecs = 100" 
      which caused the kernel to notice how long flushes were taking
      and tune its buffer sizes to be approximately right over time )

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-25 23:21                           ` Linus Torvalds
@ 2009-03-25 23:50                             ` Jan Kara
  2009-03-26  0:04                               ` Linus Torvalds
  2009-03-26  9:06                               ` ext3 IO latency measurements (was: Linux 2.6.29) Ingo Molnar
  2009-03-25 23:57                             ` Linux 2.6.29 Linus Torvalds
  1 sibling, 2 replies; 664+ messages in thread
From: Jan Kara @ 2009-03-25 23:50 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Theodore Tso, Andrew Morton, Ingo Molnar, Alan Cox,
	Arjan van de Ven, Peter Zijlstra, Nick Piggin, Jens Axboe,
	David Rees, Jesper Krogh, Linux Kernel Mailing List

On Wed 25-03-09 16:21:56, Linus Torvalds wrote:
> On Wed, 25 Mar 2009, Theodore Tso wrote:
> > 
> > Um, no, ext3 shouldn't block on writepage().  Since it doesn't do
> > delayed allocation, it should always be able to push out a dirty page
> > to the disk.
> 
> Umm. Maybe I'm mis-reading something, but they seem to all synchronize 
> with the journal with "ext3_journal_start/stop".
> 
> Which will at a minimum wait for 'j_barrier_count == 0' and 't_state != 
> T_LOCKED'. Along with making sure that there are enough transaction 
> buffers.
> 
> Do I understand _why_ ext3 does that? Hell no. The code makes no sense to 
> me. But I don't think I'm wrong.
> 
> Look at the sane case (data=ordered): it still does
> 
> 	handle = ext3_journal_start(inode, ext3_writepage_trans_blocks(inode));
> 	...
> 	err = ext3_journal_stop(handle);
> 
> around all the IO starting. Never mind that the IO shouldn't be needing 
> any journal activity at all afaik in any common case. 
> 
> Yes, yes, it may need to allocate backing store (a page that was dirtied 
> by mmap), and I'm sure that's the reason for it all, but the point is, 
> most of the time there should be no journal activity at all, yet it looks 
> very much like a simple writepage() will synchronize with a full journal 
> and wait for the journal to get space.
> 
> No?
  Yes, you got it right. Furthermore in ordered mode we need to attach
buffers to the running transaction if they aren't there (but for checking
whether they are we need to pin the running transaction and we are
basically where we started.. damn). But maybe there's a way out of it.
We don't have to guarantee data written via mmap are on disk when "the
transaction running when somebody decided to call writepage" commits (in
case no block allocation happen) and so we could just submit those buffers
for IO and don't attach them to the transaction...

> So tell me again how the VM can rely on the filesystem not blocking at 
> random points.
  I can write a patch to make writepage() in the non-"mmapped creation"
case non-blocking on journal. But I'll also have to find out whether it
really helps something. But it's probably worth trying...

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-25 23:21                           ` Linus Torvalds
  2009-03-25 23:50                             ` Jan Kara
@ 2009-03-25 23:57                             ` Linus Torvalds
  2009-03-26  0:22                               ` Jan Kara
  1 sibling, 1 reply; 664+ messages in thread
From: Linus Torvalds @ 2009-03-25 23:57 UTC (permalink / raw)
  To: Theodore Tso
  Cc: Jan Kara, Andrew Morton, Ingo Molnar, Alan Cox, Arjan van de Ven,
	Peter Zijlstra, Nick Piggin, Jens Axboe, David Rees,
	Jesper Krogh, Linux Kernel Mailing List



On Wed, 25 Mar 2009, Linus Torvalds wrote:
> 
> Yes, yes, it may need to allocate backing store (a page that was dirtied 
> by mmap), and I'm sure that's the reason for it all,

Hmm. Thinking about that, I'm not so sure. Shouldn't that backing store 
allocation happen when the page is actually dirtied on ext3?

I _suspect_ that goes back to the fact that ext3 is older than the 
"aops->set_page_dirty()" callback, and nobody taught ext3 to do the bmap's 
at dirty time, so now it does it at writeout time.

Anyway, there we are. Old filesystems do the wrong thing (block allocation 
while doing writeout because they don't do it when dirtying), and newer 
filesystems do the wrong thing (block allocations during writeout, because 
they want to do delayed allocation to do the inode dirtying after doing 
writeback).

And in either case, the VM is screwed, and can't ask for writeout, because 
it will be randomly throttled by the filesystem. So we do lots of async 
bdflush threads, which then causes IO ordering problems because now the 
writeout is all in random order.

			Linus

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Revert "gro: Fix legacy path napi_complete crash",
  2009-03-25 22:54                       ` Jarek Poplawski
@ 2009-03-26  0:03                         ` David Miller
  2009-03-26  0:10                         ` David Miller
  2009-03-26  2:41                         ` Herbert Xu
  2 siblings, 0 replies; 664+ messages in thread
From: David Miller @ 2009-03-26  0:03 UTC (permalink / raw)
  To: jarkao2
  Cc: herbert, mingo, r.schwebel, torvalds, blaschka, tglx,
	a.p.zijlstra, linux-kernel, kernel

From: Jarek Poplawski <jarkao2@gmail.com>
Date: Wed, 25 Mar 2009 23:54:56 +0100

> Herbert Xu wrote, On 03/25/2009 01:26 PM:
> 
> > On Wed, Mar 25, 2009 at 01:20:46PM +0100, Ingo Molnar wrote:
> >> ok - i have started testing the delta below, on top of the plain 
> >> revert.
> > 
> > Thanks! BTW Ingo, any chance you could help us identify the problem
> > with the previous patch? I don't have a forcedeth machine here
> > and the hang you had with my patch that open-coded __napi_complete
> > appears intimately connected to forcedeth (with NAPI enabled).
> 
> Of course it's too late for verifying this now, but (for the future)
> I think, this scenario could be considered:
> 
> process_backlog()			netif_rx()
> 
> if (!skb)
> local_irq_enable()
> 					if (queue.qlen) //NO
> 					napi_schedule() //NOTHING
> 					__skb_queue_tail() //qlen > 0
> napi_complete()
> ...					...
> 					Every next netif_rx() sees
> 					qlen > 0, so napi is never
> 					scheduled again.
> 
> Then, something like this might work...

Excellent detective work, I would have never figured this one
out.

Herbert can you take a good look at this can confirm Jarek's
findings?

Thanks!

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-25 23:50                             ` Jan Kara
@ 2009-03-26  0:04                               ` Linus Torvalds
  2009-03-26  9:06                               ` ext3 IO latency measurements (was: Linux 2.6.29) Ingo Molnar
  1 sibling, 0 replies; 664+ messages in thread
From: Linus Torvalds @ 2009-03-26  0:04 UTC (permalink / raw)
  To: Jan Kara
  Cc: Theodore Tso, Andrew Morton, Ingo Molnar, Alan Cox,
	Arjan van de Ven, Peter Zijlstra, Nick Piggin, Jens Axboe,
	David Rees, Jesper Krogh, Linux Kernel Mailing List



On Thu, 26 Mar 2009, Jan Kara wrote:
>
>   I can write a patch to make writepage() in the non-"mmapped creation"
> case non-blocking on journal. But I'll also have to find out whether it
> really helps something. But it's probably worth trying...

Actually, it really should be easier to make a patch that just does the 
journal thing if ->set_page_dirty() is called, and buffers weren't already 
allocated.

Then ext3_[ordered|writeback]_writepage() _should_ just become something 
like

	if (test_opt(inode->i_sb, NOBH))
		return nobh_writepage(page, ext3_get_block, wbc);

	return block_write_full_page(page, ext3_get_block, wbc);

and that's it. The code would be simpler to understand to boot.

			Linus

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Revert "gro: Fix legacy path napi_complete crash",
  2009-03-25 22:54                       ` Jarek Poplawski
  2009-03-26  0:03                         ` David Miller
@ 2009-03-26  0:10                         ` David Miller
  2009-03-26  6:43                           ` Jarek Poplawski
  2009-03-26  2:41                         ` Herbert Xu
  2 siblings, 1 reply; 664+ messages in thread
From: David Miller @ 2009-03-26  0:10 UTC (permalink / raw)
  To: jarkao2
  Cc: herbert, mingo, r.schwebel, torvalds, blaschka, tglx,
	a.p.zijlstra, linux-kernel, kernel

From: Jarek Poplawski <jarkao2@gmail.com>
Date: Wed, 25 Mar 2009 23:54:56 +0100

Ingo, in case it isn't completely obvious, it would be
wonderful if you could try Jarek's patch below with your
test case.

Thanks!

> Herbert Xu wrote, On 03/25/2009 01:26 PM:
> 
> > On Wed, Mar 25, 2009 at 01:20:46PM +0100, Ingo Molnar wrote:
> >> ok - i have started testing the delta below, on top of the plain 
> >> revert.
> > 
> > Thanks! BTW Ingo, any chance you could help us identify the problem
> > with the previous patch? I don't have a forcedeth machine here
> > and the hang you had with my patch that open-coded __napi_complete
> > appears intimately connected to forcedeth (with NAPI enabled).
> 
> Of course it's too late for verifying this now, but (for the future)
> I think, this scenario could be considered:
> 
> process_backlog()			netif_rx()
> 
> if (!skb)
> local_irq_enable()
> 					if (queue.qlen) //NO
> 					napi_schedule() //NOTHING
> 					__skb_queue_tail() //qlen > 0
> napi_complete()
> ...					...
> 					Every next netif_rx() sees
> 					qlen > 0, so napi is never
> 					scheduled again.
> 
> Then, something like this might work...
> 
> Jarek P.
> --- (2.6.29)
>  net/core/dev.c |    6 +++++-
>  1 files changed, 5 insertions(+), 1 deletions(-)
> 
> diff --git a/net/core/dev.c b/net/core/dev.c
> index e3fe5c7..cf53c24 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -2589,7 +2589,11 @@ static int process_backlog(struct napi_struct *napi, int quota)
>  		skb = __skb_dequeue(&queue->input_pkt_queue);
>  		if (!skb) {
>  			local_irq_enable();
> -			napi_complete(napi);
> +			napi_gro_flush(napi);
> +			local_irq_disable();
> +			if (skb_queue_empty(&queue->input_pkt_queue))
> +				__napi_complete(napi);
> +			local_irq_enable();
>  			goto out;
>  		}
>  		local_irq_enable();

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-25 23:57                             ` Linux 2.6.29 Linus Torvalds
@ 2009-03-26  0:22                               ` Jan Kara
  2009-03-26  1:34                                 ` Linus Torvalds
  0 siblings, 1 reply; 664+ messages in thread
From: Jan Kara @ 2009-03-26  0:22 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Theodore Tso, Andrew Morton, Ingo Molnar, Alan Cox,
	Arjan van de Ven, Peter Zijlstra, Nick Piggin, Jens Axboe,
	David Rees, Jesper Krogh, Linux Kernel Mailing List

On Wed 25-03-09 16:57:21, Linus Torvalds wrote:
> 
> 
> On Wed, 25 Mar 2009, Linus Torvalds wrote:
> > 
> > Yes, yes, it may need to allocate backing store (a page that was dirtied 
> > by mmap), and I'm sure that's the reason for it all,
> 
> Hmm. Thinking about that, I'm not so sure. Shouldn't that backing store 
> allocation happen when the page is actually dirtied on ext3?
  We don't do it currently. We could do it (it would also solve the problem
that we currently silently discard users data when he reaches his quota or
filesystem gets ENOSPC) but there are problems with it as well:
 1) We have to writeout blocks full of zeros on allocation so that we don't
expose unallocated data => slight slowdown
 2) When blocksize < pagesize we must play nasty tricks for this to work
(think about i_size = 1024, set_page_dirty(), truncate(f, 8192),
writepage() -> uhuh, not enough space allocated)
 3) We'll do allocation in the order in which pages are dirtied. Generally,
I'd suspect this order to be less linear than the order in which writepages
submit IO and thus it will result in the larger fragmentation of the file.
  So it's not a clear win IMHO.

> I _suspect_ that goes back to the fact that ext3 is older than the 
> "aops->set_page_dirty()" callback, and nobody taught ext3 to do the bmap's 
> at dirty time, so now it does it at writeout time.
> 
> Anyway, there we are. Old filesystems do the wrong thing (block allocation 
> while doing writeout because they don't do it when dirtying), and newer 
> filesystems do the wrong thing (block allocations during writeout, because 
> they want to do delayed allocation to do the inode dirtying after doing 
> writeback).
> 
> And in either case, the VM is screwed, and can't ask for writeout, because 
> it will be randomly throttled by the filesystem. So we do lots of async 
> bdflush threads, which then causes IO ordering problems because now the 
> writeout is all in random order.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-25 23:02                                   ` Linus Torvalds
@ 2009-03-26  0:28                                     ` Ric Wheeler
  2009-03-26  1:36                                       ` Linus Torvalds
  0 siblings, 1 reply; 664+ messages in thread
From: Ric Wheeler @ 2009-03-26  0:28 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jeff Garzik, Theodore Tso, Ingo Molnar, Alan Cox,
	Arjan van de Ven, Andrew Morton, Peter Zijlstra, Nick Piggin,
	David Rees, Jesper Krogh, Linux Kernel Mailing List

Linus Torvalds wrote:
> On Wed, 25 Mar 2009, Ric Wheeler wrote:
>   
>> One concern with doing this above the file system is that you are not in the
>> context of a transaction so you have no clean promises about what is on disk
>> and persistent when. Flushing the cache is primitive at best, but the way
>> barriers work today is designed to give the transactions some pretty critical
>> ordering semantics for journalling file systems at least.
>>
>> I don't see how you could use this approach to make a really robust, failure
>> proof storage system, but it might appear to work most of the time for most
>> people :-)
>>     
>
> You just do a write barrier after doing all the filesystem writing, and 
> you return with the guarantee that all the writes the filesystem did are 
> actually on disk.
>
>   
In this case, you have not gained anything  - same number of barrier 
operations/cache flushes and looser semantics for the transactions?
> No gray areas. No questions. No "might appear to work". 
>
> Sure, there might be other writes that got flushed _too_, but nobody 
> cares. If you have a crash later on, that's always true - you don't get 
> crashes at nice well-defined points.
>
> 			Linus
>   
This is pretty much how write barriers work today - you carry down other 
transactions (even for other partitions on the same disk) with you...

ric


^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-25 23:46                 ` Bron Gondwana
@ 2009-03-26  0:32                   ` Ric Wheeler
  0 siblings, 0 replies; 664+ messages in thread
From: Ric Wheeler @ 2009-03-26  0:32 UTC (permalink / raw)
  To: Bron Gondwana
  Cc: Linus Torvalds, Theodore Tso, David Rees, Jesper Krogh,
	Linux Kernel Mailing List

Bron Gondwana wrote:
> On Wed, Mar 25, 2009 at 04:23:08PM -0700, Linus Torvalds wrote:
>   
>> On Wed, 25 Mar 2009, Theodore Tso wrote:
>>     
>>>> The problem being that unlike the ratio, there's no sane default value 
>>>> that you can at least argue is not _entirely_ pointless.
>>>>         
>>> Well, if the maximum time that someone wants to wait for an fsync() to
>>> return is one second, and the RAID array can write 100MB/sec
>>>       
>> How are you going to tell the kernel that the RAID array can write 
>> 100MB/s?
>>
>> The kernel has no idea.
>>     
>
> Not at boot up, but after it's been using the RAID array for a little
> while it could...
>
> Bron (... imagining a tunable "max_fsync_wait_target_centisecs = 100" 
>       which caused the kernel to notice how long flushes were taking
>       and tune its buffer sizes to be approximately right over time )
>   
This tuning logic is the core of what Josef Bacik did for the 
transaction batching code for ext4....

ric


^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: [PATCH] issue storage device flush via sync_blockdev() (was Re: Linux 2.6.29)
  2009-03-25 21:56                                   ` Eric Sandeen
  2009-03-25 23:08                                     ` Jeff Garzik
@ 2009-03-26  0:58                                     ` Ric Wheeler
  2009-03-26  1:26                                       ` Jeff Garzik
  2009-03-27  7:59                                     ` Jens Axboe
  2 siblings, 1 reply; 664+ messages in thread
From: Ric Wheeler @ 2009-03-26  0:58 UTC (permalink / raw)
  To: Eric Sandeen
  Cc: Jeff Garzik, Linus Torvalds, Theodore Tso, Ingo Molnar, Alan Cox,
	Arjan van de Ven, Andrew Morton, Peter Zijlstra, Nick Piggin,
	David Rees, Jesper Krogh, Linux Kernel Mailing List

Eric Sandeen wrote:
> Jeff Garzik wrote:
>   
>> On Wed, Mar 25, 2009 at 01:40:37PM -0700, Linus Torvalds wrote:
>>     
>>> On Wed, 25 Mar 2009, Jeff Garzik wrote:
>>>       
>>>> It is clearly possible to implement an fsync(2) that causes FLUSH CACHE to be
>>>> issued, without adding full barrier support to a filesystem.  It is likely
>>>> doable to avoid touching per-filesystem code at all, if we issue the flush
>>>> from a generic fsync(2) code path in the kernel.
>>>>         
>>> We could easily do that. It would even work for most cases. The 
>>> problematic ones are where filesystems do their own disk management, but I 
>>> guess those people can do their own fsync() management too.
>>>
>>> Somebody send me the patch, we can try it out.
>>>       
>> This is a simple step that would cover a lot of cases...  sync(2)
>> calls sync_blockdev(), and many filesystems do as well via the generic
>> filesystem helper file_fsync (fs/sync.c).
>>
>> XFS code calls sync_blockdev() a "big hammer", so I hope my patch
>> follows with known practice.
>>
>> Looking over every use of sync_blockdev(), its most frequent use is
>> through fsync(2), for the selected filesystems that use the generic
>> file_fsync helper.
>>
>> Most callers of sync_blockdev() in the kernel do so infrequently,
>> when removing and invalidating volumes (MD) or storing the superblock
>> prior to release (put_super) in some filesystems.
>>
>> Compile-tested only, of course :)  But it should be work :)
>>
>> My main concern is some hidden area that calls sync_blockdev() with
>> a high-enough frequency that the performance hit is bad.
>>
>> Signed-off-by: Jeff Garzik <jgarzik@redhat.com>
>>
>> diff --git a/fs/buffer.c b/fs/buffer.c
>> index 891e1c7..7b9f74a 100644
>> --- a/fs/buffer.c
>> +++ b/fs/buffer.c
>> @@ -173,9 +173,14 @@ int sync_blockdev(struct block_device *bdev)
>>  {
>>  	int ret = 0;
>>  
>> -	if (bdev)
>> -		ret = filemap_write_and_wait(bdev->bd_inode->i_mapping);
>> -	return ret;
>> +	if (!bdev)
>> +		return 0;
>> +	
>> +	ret = filemap_write_and_wait(bdev->bd_inode->i_mapping);
>> +	if (ret)
>> +		return ret;
>> +	
>> +	return blkdev_issue_flush(bdev, NULL);
>>  }
>>  EXPORT_SYMBOL(sync_blockdev);
>>     
>
> What about when you're running over a big raid device with
> battery-backed cache, and you trust the cache as much as much as the
> disks.  Wouldn't this unconditional cache flush be painful there on any
> of the callers even if they're rare?  (fs unmounts, freezes, unmounts,
> etc?  Or a fat filesystem on that device doing an fsync?)
>
> xfs, reiserfs, ext4 all avoid the blkdev flush on fsync if barriers are
> not enabled, I think for that reason...
>
> (I'm assuming these raid devices still honor a cache flush request even
> if they're battery-backed?  I dunno.)
>
> -Eric
>   
I think that Jeff's patch misses the whole need to protect transactions, 
including meta data, in a precise way. Useful for thing like unmount, 
not to give us strong protection for transactions or for fsync().

This patch will be adding overhead here - you will still need flushing 
at the transaction commit layer of the specific file systems to get any 
reliable transactions.

Having looked at the timing of barrier flushes on slow s-ata drives with 
an analyser a few years back, the first one is expensive (as you would 
expect with a large drive cache of 16 or 32 MB) and the second was 
nearly free.

Moving the expensive flush to this layer guts the transaction building 
blocks and costs the same....

ric




^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: [PATCH] issue storage device flush via sync_blockdev() (was Re: Linux 2.6.29)
  2009-03-26  0:58                                     ` Ric Wheeler
@ 2009-03-26  1:26                                       ` Jeff Garzik
  2009-03-26  1:33                                         ` Jeff Garzik
  2009-03-26  8:24                                         ` Christoph Hellwig
  0 siblings, 2 replies; 664+ messages in thread
From: Jeff Garzik @ 2009-03-26  1:26 UTC (permalink / raw)
  To: Ric Wheeler
  Cc: Eric Sandeen, Linus Torvalds, Theodore Tso, Ingo Molnar,
	Alan Cox, Arjan van de Ven, Andrew Morton, Peter Zijlstra,
	Nick Piggin, David Rees, Jesper Krogh, Linux Kernel Mailing List

Ric Wheeler wrote:
> Eric Sandeen wrote:
>> Jeff Garzik wrote:
>>  
>>> On Wed, Mar 25, 2009 at 01:40:37PM -0700, Linus Torvalds wrote:
>>>    
>>>> On Wed, 25 Mar 2009, Jeff Garzik wrote:
>>>>      
>>>>> It is clearly possible to implement an fsync(2) that causes FLUSH 
>>>>> CACHE to be
>>>>> issued, without adding full barrier support to a filesystem.  It is 
>>>>> likely
>>>>> doable to avoid touching per-filesystem code at all, if we issue 
>>>>> the flush
>>>>> from a generic fsync(2) code path in the kernel.
>>>>>         
>>>> We could easily do that. It would even work for most cases. The 
>>>> problematic ones are where filesystems do their own disk management, 
>>>> but I guess those people can do their own fsync() management too.
>>>>
>>>> Somebody send me the patch, we can try it out.
>>>>       
>>> This is a simple step that would cover a lot of cases...  sync(2)
>>> calls sync_blockdev(), and many filesystems do as well via the generic
>>> filesystem helper file_fsync (fs/sync.c).
>>>
>>> XFS code calls sync_blockdev() a "big hammer", so I hope my patch
>>> follows with known practice.
>>>
>>> Looking over every use of sync_blockdev(), its most frequent use is
>>> through fsync(2), for the selected filesystems that use the generic
>>> file_fsync helper.
>>>
>>> Most callers of sync_blockdev() in the kernel do so infrequently,
>>> when removing and invalidating volumes (MD) or storing the superblock
>>> prior to release (put_super) in some filesystems.
>>>
>>> Compile-tested only, of course :)  But it should be work :)
>>>
>>> My main concern is some hidden area that calls sync_blockdev() with
>>> a high-enough frequency that the performance hit is bad.
>>>
>>> Signed-off-by: Jeff Garzik <jgarzik@redhat.com>
>>>
>>> diff --git a/fs/buffer.c b/fs/buffer.c
>>> index 891e1c7..7b9f74a 100644
>>> --- a/fs/buffer.c
>>> +++ b/fs/buffer.c
>>> @@ -173,9 +173,14 @@ int sync_blockdev(struct block_device *bdev)
>>>  {
>>>      int ret = 0;
>>>  
>>> -    if (bdev)
>>> -        ret = filemap_write_and_wait(bdev->bd_inode->i_mapping);
>>> -    return ret;
>>> +    if (!bdev)
>>> +        return 0;
>>> +   
>>> +    ret = filemap_write_and_wait(bdev->bd_inode->i_mapping);
>>> +    if (ret)
>>> +        return ret;
>>> +   
>>> +    return blkdev_issue_flush(bdev, NULL);
>>>  }
>>>  EXPORT_SYMBOL(sync_blockdev);
>>>     
>>
>> What about when you're running over a big raid device with
>> battery-backed cache, and you trust the cache as much as much as the
>> disks.  Wouldn't this unconditional cache flush be painful there on any
>> of the callers even if they're rare?  (fs unmounts, freezes, unmounts,
>> etc?  Or a fat filesystem on that device doing an fsync?)
>>
>> xfs, reiserfs, ext4 all avoid the blkdev flush on fsync if barriers are
>> not enabled, I think for that reason...
>>
>> (I'm assuming these raid devices still honor a cache flush request even
>> if they're battery-backed?  I dunno.)
>>
>> -Eric
>>   
> I think that Jeff's patch misses the whole need to protect transactions, 
> including meta data, in a precise way. Useful for thing like unmount, 
> not to give us strong protection for transactions or for fsync().

What do you think sync_blockdev() does?  What is its purpose?

Twofold:
(1) guarantee all user data is flushed out before a major event 
(unmount, journal close, unplug, poweroff, explosion, ...)

(2) As a sledgehammer hack for simple or legacy filesystems that do not 
wish or need the complexity of transactional protection. 
sync_blockdev() is intentionally used in lieu of complexity for the 
following filesystems: HFS, HFS+, ADFS, AFFS, FAT, bfs, UFS, NTFS, qnx4.

My patch adds needed guarantees, only for the above filesystems, where 
none were present before.


> This patch will be adding overhead here - you will still need flushing 
> at the transaction commit layer of the specific file systems to get any 
> reliable transactions.

sync_blockdev() is used as fsync(2) only in simple or legacy filesystems 
that do not want a transaction commit layer!

Read the patch :)

	Jeff



^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: [PATCH] issue storage device flush via sync_blockdev() (was Re: Linux 2.6.29)
  2009-03-26  1:26                                       ` Jeff Garzik
@ 2009-03-26  1:33                                         ` Jeff Garzik
  2009-03-26  1:39                                           ` Ric Wheeler
  2009-03-26  8:24                                         ` Christoph Hellwig
  1 sibling, 1 reply; 664+ messages in thread
From: Jeff Garzik @ 2009-03-26  1:33 UTC (permalink / raw)
  To: Ric Wheeler
  Cc: Eric Sandeen, Linus Torvalds, Theodore Tso, Ingo Molnar,
	Alan Cox, Arjan van de Ven, Andrew Morton, Peter Zijlstra,
	Nick Piggin, David Rees, Jesper Krogh, Linux Kernel Mailing List

Jeff Garzik wrote:
> Ric Wheeler wrote:
> What do you think sync_blockdev() does?  What is its purpose?

> Twofold:
> (1) guarantee all user data is flushed out before a major event 
> (unmount, journal close, unplug, poweroff, explosion, ...)

> (2) As a sledgehammer hack for simple or legacy filesystems that do not 
> wish or need the complexity of transactional protection. sync_blockdev() 
> is intentionally used in lieu of complexity for the following 
> filesystems: HFS, HFS+, ADFS, AFFS, FAT, bfs, UFS, NTFS, qnx4.

> My patch adds needed guarantees, only for the above filesystems, where 
> none were present before.

To be specific, I was referring to fsync(2) guarantees being added to 
HFS, HFS+, ADFS, AFFS, FAT, bfs, UFS, NTFS, and qnx4.

Other filesystems, besides those in the list, gain the flush-on-unmount 
action (a rare but useful addition) with my patch.

	Jeff




^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-26  0:22                               ` Jan Kara
@ 2009-03-26  1:34                                 ` Linus Torvalds
  2009-03-26  2:59                                   ` Theodore Tso
  2009-03-26 16:24                                   ` Jan Kara
  0 siblings, 2 replies; 664+ messages in thread
From: Linus Torvalds @ 2009-03-26  1:34 UTC (permalink / raw)
  To: Jan Kara
  Cc: Theodore Tso, Andrew Morton, Ingo Molnar, Alan Cox,
	Arjan van de Ven, Peter Zijlstra, Nick Piggin, Jens Axboe,
	David Rees, Jesper Krogh, Linux Kernel Mailing List



On Thu, 26 Mar 2009, Jan Kara wrote:
>
>  1) We have to writeout blocks full of zeros on allocation so that we don't
> expose unallocated data => slight slowdown

Why? 

This is in _no_ way different from a regular "write()" system call. And 
there, we just attach the buffers to the page. If something crashes before 
the page actually gets written out, then we'll have hopefully never 
written out the metadata (that's what "data=ordered" means).

>  2) When blocksize < pagesize we must play nasty tricks for this to work
> (think about i_size = 1024, set_page_dirty(), truncate(f, 8192),
> writepage() -> uhuh, not enough space allocated)

Good point. I suspect not enough people have played around with 
"set_page_dirty()" to find these kinds of things. The VFS layer probably 
doesn't help sufficiently with the half-dirty pages, although the FS can 
obviously always look up the previously last page and do things manually 
if it wants to.

But yes, this is nasty.

>  3) We'll do allocation in the order in which pages are dirtied. Generally,
> I'd suspect this order to be less linear than the order in which writepages
> submit IO and thus it will result in the larger fragmentation of the file.
>   So it's not a clear win IMHO.

Yes, that may be the case.

Of course, the approach of just checking whether the buffer heads already 
exists and are mapped (before bothering with anything else) probably works 
fine in practice. In most loads, pages will have been dirtied by regular 
"write()" system calls, and then we will have the buffers pre-allocated 
regardless. 

			Linus

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-26  0:28                                     ` Ric Wheeler
@ 2009-03-26  1:36                                       ` Linus Torvalds
  0 siblings, 0 replies; 664+ messages in thread
From: Linus Torvalds @ 2009-03-26  1:36 UTC (permalink / raw)
  To: Ric Wheeler
  Cc: Jeff Garzik, Theodore Tso, Ingo Molnar, Alan Cox,
	Arjan van de Ven, Andrew Morton, Peter Zijlstra, Nick Piggin,
	David Rees, Jesper Krogh, Linux Kernel Mailing List



On Wed, 25 Mar 2009, Ric Wheeler wrote:
>
> In this case, you have not gained anything  - same number of barrier
> operations/cache flushes and looser semantics for the transactions?

Um. Except you gained the fact that the filesystem doesn't have to care 
and screw it up. And then we can know that it gets done, regardless of 
what odd things the low-level fs does.

		Linus

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: [PATCH] issue storage device flush via sync_blockdev() (was Re: Linux 2.6.29)
  2009-03-26  1:33                                         ` Jeff Garzik
@ 2009-03-26  1:39                                           ` Ric Wheeler
  0 siblings, 0 replies; 664+ messages in thread
From: Ric Wheeler @ 2009-03-26  1:39 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Eric Sandeen, Linus Torvalds, Theodore Tso, Ingo Molnar,
	Alan Cox, Arjan van de Ven, Andrew Morton, Peter Zijlstra,
	Nick Piggin, David Rees, Jesper Krogh, Linux Kernel Mailing List

Jeff Garzik wrote:
> Jeff Garzik wrote:
>> Ric Wheeler wrote:
>> What do you think sync_blockdev() does?  What is its purpose?
>
>> Twofold:
>> (1) guarantee all user data is flushed out before a major event 
>> (unmount, journal close, unplug, poweroff, explosion, ...)
>
>> (2) As a sledgehammer hack for simple or legacy filesystems that do 
>> not wish or need the complexity of transactional protection. 
>> sync_blockdev() is intentionally used in lieu of complexity for the 
>> following filesystems: HFS, HFS+, ADFS, AFFS, FAT, bfs, UFS, NTFS, qnx4.
>
>> My patch adds needed guarantees, only for the above filesystems, 
>> where none were present before.
>
> To be specific, I was referring to fsync(2) guarantees being added to 
> HFS, HFS+, ADFS, AFFS, FAT, bfs, UFS, NTFS, and qnx4.
>
> Other filesystems, besides those in the list, gain the 
> flush-on-unmount action (a rare but useful addition) with my patch.
>
>     Jeff
>
>
>
Sorry for misunderstanding the scope of this before - this is certainly 
a net win for the file systems that don't have proper barrier support 
baked in already. Thanks!

Ric



^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-25 21:50                         ` Theodore Tso
@ 2009-03-26  2:10                           ` Matthew Garrett
  2009-03-26  2:36                             ` Jeff Garzik
       [not found]                             ` <f73f7ab80903251944s581166bbk31c26db50750814a@mail.gmail.com>
  0 siblings, 2 replies; 664+ messages in thread
From: Matthew Garrett @ 2009-03-26  2:10 UTC (permalink / raw)
  To: Theodore Tso, Christoph Hellwig, Linus Torvalds, Jan Kara,
	Andrew Morton, Ingo Molnar, Alan Cox, Arjan van de Ven,
	Peter Zijlstra, Nick Piggin, Jens Axboe, David Rees,
	Jesper Krogh, Linux Kernel Mailing List

On Wed, Mar 25, 2009 at 05:50:16PM -0400, Theodore Tso wrote:

> To be fair, though, one problem which Matthew Garrett has pointed out
> is that if lots of applications issue fsync(), it will have the
> tendency to wake up the hard drive a lot, and do a real number on
> power utilization.  I believe the right solution for this is an
> extension to laptop mode which synchronizes the filesystem at a clean
> point, and then which suppresses fsync()'s until the hard drive wakes
> up, at which point it should flush all dirty data to the drive, and
> then freezes writes to the disk again.  Presumably that should be OK,
> because who are using laptop mode are inherently trading off a certain
> amount of safety for power savings; but then other people who want to
> run a mysql server on a laptop get cranky, and then if we start
> implementing ways that applications can exempt themselves from the
> fsync() suppression, the complexity level starts rising.

I disagree with this approach. If fsync() means anything other than "Get 
my data on disk and then return" then we're breaking guarantees to 
applications. The problem is that you're insisting that the only way 
applications can ensure that their requests occur in order is to use 
fsync(), which will achieve that but also provides guarantees above and 
beyond what the majority of applications want.

I've done some benchmarking now and I'm actually fairly happy with the 
behaviour of ext4 now - it seems that the real world impact of doing the 
block allocation at rename time isn't that significant, and if that's 
the only practical way to ensure ordering guarantees in ext4 then fine. 
But given that, I don't think there's any reason to try to convince 
application authors to use fsync() more.

-- 
Matthew Garrett | mjg59@srcf.ucam.org

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: [PATCH] issue storage device flush via sync_blockdev() (was Re: Linux 2.6.29)
  2009-03-25 23:08                                     ` Jeff Garzik
@ 2009-03-26  2:31                                       ` Eric Sandeen
  2009-03-26 14:19                                         ` Ric Wheeler
  0 siblings, 1 reply; 664+ messages in thread
From: Eric Sandeen @ 2009-03-26  2:31 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Linus Torvalds, Theodore Tso, Ingo Molnar, Alan Cox,
	Arjan van de Ven, Andrew Morton, Peter Zijlstra, Nick Piggin,
	David Rees, Jesper Krogh, Linux Kernel Mailing List

Jeff Garzik wrote:
> Eric Sandeen wrote:

>> What about when you're running over a big raid device with
>> battery-backed cache, and you trust the cache as much as much as the
>> disks.  Wouldn't this unconditional cache flush be painful there on any
>> of the callers even if they're rare?  (fs unmounts, freezes, unmounts,
>> etc?  Or a fat filesystem on that device doing an fsync?)
> 
> What exactly do you think sync_blockdev() does?  :)

It used to push os cached data to the storage.  Now it tells the storage
to flush cache too (with your patch).  This seems fine in general,
although it's not a panacea for all the various data integrity issues
that are being tossed about in this thread.  :)

> It is used right before a volume goes away.  If that's not a time to 
> flush the cache, I dunno what is.
> 
> The _whole purpose_ of sync_blockdev() is to push out the data to 
> permanent storage.  Look at the users -- unmount volume, journal close, 
> etc.  Things that are OK to occur after those points include: power off, 
> device unplug, etc.

Sure.  But I was thinking about enterprise raids with battery backup
which may last for days.

But, ok, I wasn't thinking quite right about the unmount situations etc;
even on enterprise raids like this, flushing things out on unmount makes
sense in the case where you lose power post-unmount and can't restore
power before the battery backup dies.

I also wondered if a cache flush on one lun issues a cache flush for the
entire controller, or just for that lun.  Hopefully the latter, in which
case it's not as big a deal.

> A secondary purpose of sync_blockdev() is as a hack, for simple/ancient 
> bdev-based filesystems that do not wish to bother with barriers and all 
> the associated complexity to tracking what writes do/do not need flushing.

Yep, I get that and it seems reasonable except ....

>> xfs, reiserfs, ext4 all avoid the blkdev flush on fsync if barriers are
>> not enabled, I think for that reason...
> 
> Enabling barriers causes slowdowns far greater than that of simply 
> causing fsync(2) to trigger FLUSH CACHE, because barriers imply FLUSH 
> CACHE issuance for all in-kernel filesystem journalled/atomic 
> transactions, in addition to whatever syscalls userspace is issuing.
> 
> The number of FLUSH CACHES w/ barriers is orders of magnitude larger 
> than the number of fsync/fdatasync calls.

I understand all that.

My point is that the above filesystems (xfs, reiserfs, ext4) skip the
blkdev flush on fsync when barriers are explicitly disabled.

They do this because if an admin disables barriers, they are trusting
that the write cache is nonvolatile and will be able to destage fully
even if external power is lost for some time.

In that case you don't need a blkdev_issue_flush on fsync either (or are
at least willing to live with the diminished risk, thanks to the battery
backup), and on xfs, ext4 etc you can turn it off (it goes away w/ the
barriers off setting).  With this change to the simple generic fsync
path, you can't turn it off for those filesystems that use it for fsync.

But I suppose it's rare that anybody ever uses a filesystem which uses
this generic sync method on any sort of interesting storage like I'm
talking about, and it's not a big deal...  (or maybe that interesting
storage just ignores cache flushes anyway, I dunno).

My main concerns were that these extra cache flushes for fsync aren't
tunable, and that flushes on one lun might affect other luns.  I guess
I've talked myself out of those concerns in a couple different ways now.  ;)

-Eric

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-26  2:10                           ` Matthew Garrett
@ 2009-03-26  2:36                             ` Jeff Garzik
  2009-03-26  2:42                               ` Matthew Garrett
       [not found]                             ` <f73f7ab80903251944s581166bbk31c26db50750814a@mail.gmail.com>
  1 sibling, 1 reply; 664+ messages in thread
From: Jeff Garzik @ 2009-03-26  2:36 UTC (permalink / raw)
  To: Matthew Garrett
  Cc: Theodore Tso, Christoph Hellwig, Linus Torvalds, Jan Kara,
	Andrew Morton, Ingo Molnar, Alan Cox, Arjan van de Ven,
	Peter Zijlstra, Nick Piggin, Jens Axboe, David Rees,
	Jesper Krogh, Linux Kernel Mailing List

Matthew Garrett wrote:
> I disagree with this approach. If fsync() means anything other than "Get 
> my data on disk and then return" then we're breaking guarantees to 
> applications.

Due to lack of storage dev writeback cache flushing, we are indeed 
breaking that guarantee in many situations...


> The problem is that you're insisting that the only way 
> applications can ensure that their requests occur in order is to use 
> fsync(), which will achieve that but also provides guarantees above and 
> beyond what the majority of applications want.

That remains a true statement...   without the *sync* syscalls, you 
still do not have a _guarantee_ writes occur in a certain order.

	Jeff



^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Revert "gro: Fix legacy path napi_complete crash",
  2009-03-25 22:54                       ` Jarek Poplawski
  2009-03-26  0:03                         ` David Miller
  2009-03-26  0:10                         ` David Miller
@ 2009-03-26  2:41                         ` Herbert Xu
  2009-03-26  3:20                           ` David Miller
  2009-03-26  7:39                           ` Jarek Poplawski
  2 siblings, 2 replies; 664+ messages in thread
From: Herbert Xu @ 2009-03-26  2:41 UTC (permalink / raw)
  To: Jarek Poplawski
  Cc: Ingo Molnar, David Miller, r.schwebel, torvalds, blaschka, tglx,
	a.p.zijlstra, linux-kernel, kernel

On Wed, Mar 25, 2009 at 11:54:56PM +0100, Jarek Poplawski wrote:
> 
> Of course it's too late for verifying this now, but (for the future)
> I think, this scenario could be considered:
> 
> process_backlog()			netif_rx()
> 
> if (!skb)
> local_irq_enable()
> 					if (queue.qlen) //NO
> 					napi_schedule() //NOTHING
> 					__skb_queue_tail() //qlen > 0
> napi_complete()
> ...					...
> 					Every next netif_rx() sees
> 					qlen > 0, so napi is never
> 					scheduled again.
> 
> Then, something like this might work...

Yes this is why my original patch that started all this is broken.
However, this doesn't apply to my patch that open-codes __napi_complete.

> Jarek P.
> --- (2.6.29)
>  net/core/dev.c |    6 +++++-
>  1 files changed, 5 insertions(+), 1 deletions(-)
> 
> diff --git a/net/core/dev.c b/net/core/dev.c
> index e3fe5c7..cf53c24 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -2589,7 +2589,11 @@ static int process_backlog(struct napi_struct *napi, int quota)
>  		skb = __skb_dequeue(&queue->input_pkt_queue);
>  		if (!skb) {
>  			local_irq_enable();
> -			napi_complete(napi);
> +			napi_gro_flush(napi);
> +			local_irq_disable();
> +			if (skb_queue_empty(&queue->input_pkt_queue))
> +				__napi_complete(napi);
> +			local_irq_enable();

This should work too.  However, the fact that the following patch
is broken means that we have bigger problems.

net: Fix netpoll lockup in legacy receive path

When I fixed the GRO crash in the legacy receive path I used
napi_complete to replace __napi_complete.  Unfortunately they're
not the same when NETPOLL is enabled, which may result in us
not calling __napi_complete at all.

What's more, we really do need to keep the __napi_complete call
within the IRQ-off section since in theory an IRQ can occur in
between and fill up the backlog to the maximum, causing us to
lock up.

This patch fixes this by essentially open-coding __napi_complete.

Note we no longer need the memory barrier because this function
is per-cpu.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

diff --git a/net/core/dev.c b/net/core/dev.c
index e3fe5c7..2a7f6b3 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2588,9 +2588,10 @@ static int process_backlog(struct napi_struct *napi, int quota)
 		local_irq_disable();
 		skb = __skb_dequeue(&queue->input_pkt_queue);
 		if (!skb) {
+			list_del(&napi->poll_list);
+			clear_bit(NAPI_STATE_SCHED, &napi->state);
 			local_irq_enable();
-			napi_complete(napi);
-			goto out;
+			break;
 		}
 		local_irq_enable();
 
@@ -2599,7 +2600,6 @@ static int process_backlog(struct napi_struct *napi, int quota)
 
 	napi_gro_flush(napi);
 
-out:
 	return work;
 }

Thanks,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply related	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-26  2:36                             ` Jeff Garzik
@ 2009-03-26  2:42                               ` Matthew Garrett
  0 siblings, 0 replies; 664+ messages in thread
From: Matthew Garrett @ 2009-03-26  2:42 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Theodore Tso, Christoph Hellwig, Linus Torvalds, Jan Kara,
	Andrew Morton, Ingo Molnar, Alan Cox, Arjan van de Ven,
	Peter Zijlstra, Nick Piggin, Jens Axboe, David Rees,
	Jesper Krogh, Linux Kernel Mailing List

On Wed, Mar 25, 2009 at 10:36:31PM -0400, Jeff Garzik wrote:
> Matthew Garrett wrote:
> >The problem is that you're insisting that the only way 
> >applications can ensure that their requests occur in order is to use 
> >fsync(), which will achieve that but also provides guarantees above and 
> >beyond what the majority of applications want.
> 
> That remains a true statement...   without the *sync* syscalls, you 
> still do not have a _guarantee_ writes occur in a certain order.

The interesting case is whether data hits disk before metadata when 
renaming over the top of an existing file, which appears to be 
guaranteed in the default ext4 configuration now? I'm sure there are 
filesystems where this isn't the case, but that's mostly just an 
argument that it's not sensible to use those filesystems if your 
system's at any risk of crashing.

-- 
Matthew Garrett | mjg59@srcf.ucam.org

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
       [not found]                             ` <f73f7ab80903251944s581166bbk31c26db50750814a@mail.gmail.com>
@ 2009-03-26  2:46                               ` Kyle Moffett
  2009-03-26  2:51                                 ` Jeff Garzik
  2009-03-26  2:47                               ` Matthew Garrett
  1 sibling, 1 reply; 664+ messages in thread
From: Kyle Moffett @ 2009-03-26  2:46 UTC (permalink / raw)
  To: Matthew Garrett
  Cc: Theodore Tso, Christoph Hellwig, Linus Torvalds, Jan Kara,
	Andrew Morton, Ingo Molnar, Alan Cox, Arjan van de Ven,
	Peter Zijlstra, Nick Piggin, Jens Axboe, David Rees,
	Jesper Krogh, Linux Kernel Mailing List

Apologies for the HTML email, resent in ASCII below:

> On Wed, Mar 25, 2009 at 10:10 PM, Matthew Garrett <mjg59@srcf.ucam.org> wrote:
>>
>> If fsync() means anything other than "Get
>> my data on disk and then return" then we're breaking guarantees to
>> applications. The problem is that you're insisting that the only way
>> applications can ensure that their requests occur in order is to use
>> fsync(), which will achieve that but also provides guarantees above and
>> beyond what the majority of applications want.
>>
>> I've done some benchmarking now and I'm actually fairly happy with the
>> behaviour of ext4 now - it seems that the real world impact of doing the
>> block allocation at rename time isn't that significant, and if that's
>> the only practical way to ensure ordering guarantees in ext4 then fine.
>> But given that, I don't think there's any reason to try to convince
>> application authors to use fsync() more.
>
> Really, the problem is the filesystem interfaces are incomplete.  There are plenty of ways to specify a "FLUSH CACHE"-type command for an individual file or for the whole filesystem, but there aren't really any ways for programs to specify barriers (either whole-blockdev or per-LBA-range).  An fsync() implies you want to *wait* for the data... there's no way to ask it all to be queued with some ordering constraints.
> Perhaps we ought to add a couple extra open flags, O_BARRIER_BEFORE and O_BARRIER_AFTER, and rename3(), etc functions that take flags arguments?
> Or maybe a new set of syscalls like barrier(file1, file2) and fbarrier(fd1, fd2), which cause all pending changes (perhaps limit to this process?) to the file at fd1 to occur before any successive changes (again limited to this process?) to the file at fd2.
> It seems that rename(oldfile, newfile) with an already-existing newfile should automatically imply barrier(oldfile, newfile) before it occurs, simply because so many programs rely on that.
> In the cross-filesystem case, the fbarrier() might simply fsync(fd1), since that would provide the equivalent guarantee, albeit with possibly significant performance penalties.  I can't think of any easy way to prevent one filesystem from syncing writes to a particular file until another filesystem has finished an asynchronous fsync() call.  Perhaps a half-way solution would be to asynchronously fsync(fd1) and simply block the next write()/ioctl()/etc on fd2 until the async fsync returns.
> Are there other ideas for useful barrier()-generating file APIs?
> Cheers,
> Kyle Moffett

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
       [not found]                             ` <f73f7ab80903251944s581166bbk31c26db50750814a@mail.gmail.com>
  2009-03-26  2:46                               ` Kyle Moffett
@ 2009-03-26  2:47                               ` Matthew Garrett
  2009-03-26  2:54                                 ` Kyle Moffett
  1 sibling, 1 reply; 664+ messages in thread
From: Matthew Garrett @ 2009-03-26  2:47 UTC (permalink / raw)
  To: Kyle Moffett
  Cc: Theodore Tso, Christoph Hellwig, Linus Torvalds, Jan Kara,
	Andrew Morton, Ingo Molnar, Alan Cox, Arjan van de Ven,
	Peter Zijlstra, Nick Piggin, Jens Axboe, David Rees,
	Jesper Krogh, Linux Kernel Mailing List

On Wed, Mar 25, 2009 at 10:44:44PM -0400, Kyle Moffett wrote:

>    Perhaps we ought to add a couple extra open flags, O_BARRIER_BEFORE and
>    O_BARRIER_AFTER, and rename3(), etc functions that take flags arguments?
>    Or maybe a new set of syscalls like barrier(file1, file2) and
>    fbarrier(fd1, fd2), which cause all pending changes (perhaps limit to this
>    process?) to the file at fd1 to occur before any successive changes (again
>    limited to this process?) to the file at fd2.

That's an option, but what would benefit? If rename is expected to 
preserve ordering (which I think it has to, in order to avoid breaking 
existing code) then are there any other interesting use cases?

-- 
Matthew Garrett | mjg59@srcf.ucam.org

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-25 22:05             ` Theodore Tso
  2009-03-25 23:23               ` Linus Torvalds
@ 2009-03-26  2:50               ` Neil Brown
  2009-03-26  3:13                 ` Theodore Tso
  1 sibling, 1 reply; 664+ messages in thread
From: Neil Brown @ 2009-03-26  2:50 UTC (permalink / raw)
  To: Theodore Tso
  Cc: Linus Torvalds, David Rees, Jesper Krogh, Linux Kernel Mailing List

On Wednesday March 25, tytso@mit.edu wrote:
> On Wed, Mar 25, 2009 at 11:40:28AM -0700, Linus Torvalds wrote:
> > On Wed, 25 Mar 2009, Theodore Tso wrote:
> > > I'm beginning to think that using a "ratio" may be the wrong way to
> > > go.  We probably need to add an optional dirty_max_megabytes field
> > > where we start pushing dirty blocks out when the number of dirty
> > > blocks exceeds either the dirty_ratio or the dirty_max_megabytes,
> > > which ever comes first.
> > 
> > We have that. Except it's called "dirty_bytes" and 
> > "dirty_background_bytes", and it defaults to zero (off).
> > 
> > The problem being that unlike the ratio, there's no sane default value 
> > that you can at least argue is not _entirely_ pointless.
> 
> Well, if the maximum time that someone wants to wait for an fsync() to
> return is one second, and the RAID array can write 100MB/sec, then
> setting a value of 100MB makes a certain amount of sense.  Yes, this
> doesn't take seek overheads into account, and it may be that we're not
> writing things out in an optimal order, as Alan as pointed out.  But
> 100MB is much lower number than 5% of 32GB (1.6GB).  It would be
> better if these numbers were accounted on a per-filesystem instead of
> a global threshold, but for people who are complaining about huge
> latencies, it at least a partial workaround that they can use today.

We do a lot of dirty accounting on a per-backing_device basis.  This
was added to stop slow devices from sucking up too much for the "40%
dirty" space.  The allowable dirty space is now shared among all
devices in rough proportion to how quickly they write data out.

My memory of how it works isn't perfect, but we count write-out
completions both globally and per-bdi and maintain a fraction:
      my-writeout-completions
    --------------------------
    total-writeout-completions

That device then gets a share of the available dirty space based on
the fraction.

The counts decay some-how so that the fraction represents recent
activity.

I shouldn't be too hard to add some concept of total time to this.
If we track the number of write-outs per unit time and use that together
with a "target time for fsync" to scale the 'dirty_bytes' number, we
might be able to auto-tune the amount of dirty space to fit the speeds
of the drives.

We would probably start with each device having a very low "max dirty"
number which would cause writeouts to start soon.  Once the device
demonstrates that it can do n-per-second (or whatever) the VM would
allow the "max dirty" number to drift upwards.  I'm not sure how best
to get it to move downwards if the device slows down (or the kernel
over-estimated).  Maybe it should regularly decay so that the device
keeps have to "prove" itself.

We would still leave the "dirty_ratio" as an upper-limit because we
don't want all of memory to be dirty (and 40% still sounds about
right).  But we would not have a time-based value to set a more
realistic limit when there is enough memory to keep the devices busy
for multiple minutes.

Sorry, no code yet.  But I think the idea is sound.

NeilBrown

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-26  2:46                               ` Kyle Moffett
@ 2009-03-26  2:51                                 ` Jeff Garzik
  2009-03-26  3:03                                   ` Kyle Moffett
  0 siblings, 1 reply; 664+ messages in thread
From: Jeff Garzik @ 2009-03-26  2:51 UTC (permalink / raw)
  To: Kyle Moffett
  Cc: Matthew Garrett, Theodore Tso, Christoph Hellwig, Linus Torvalds,
	Jan Kara, Andrew Morton, Ingo Molnar, Alan Cox, Arjan van de Ven,
	Peter Zijlstra, Nick Piggin, Jens Axboe, David Rees,
	Jesper Krogh, Linux Kernel Mailing List

Kyle Moffett wrote:
>> Really, the problem is the filesystem interfaces are incomplete.  There are plenty of ways to specify a "FLUSH CACHE"-type command for an individual file or for the whole filesystem, but there aren't really any ways for programs to specify barriers (either whole-blockdev or per-LBA-range).  An fsync() implies you want to *wait* for the data... there's no way to ask it all to be queued with some ordering constraints.
>> Perhaps we ought to add a couple extra open flags, O_BARRIER_BEFORE and O_BARRIER_AFTER, and rename3(), etc functions that take flags arguments?
>> Or maybe a new set of syscalls like barrier(file1, file2) and fbarrier(fd1, fd2), which cause all pending changes (perhaps limit to this process?) to the file at fd1 to occur before any successive changes (again limited to this process?) to the file at fd2.
>> It seems that rename(oldfile, newfile) with an already-existing newfile should automatically imply barrier(oldfile, newfile) before it occurs, simply because so many programs rely on that.
>> In the cross-filesystem case, the fbarrier() might simply fsync(fd1), since that would provide the equivalent guarantee, albeit with possibly significant performance penalties.  I can't think of any easy way to prevent one filesystem from syncing writes to a particular file until another filesystem has finished an asynchronous fsync() call.  Perhaps a half-way solution would be to asynchronously fsync(fd1) and simply block the next write()/ioctl()/etc on fd2 until the async fsync returns.

Then you have just reinvented the transactional userspace API that 
people often want to replace POSIX API with.  Maybe one day they will 
succeed.

But "POSIX API replacement" is an area never short of proposals... :)

	Jeff




^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-26  2:47                               ` Matthew Garrett
@ 2009-03-26  2:54                                 ` Kyle Moffett
  0 siblings, 0 replies; 664+ messages in thread
From: Kyle Moffett @ 2009-03-26  2:54 UTC (permalink / raw)
  To: Matthew Garrett
  Cc: Theodore Tso, Christoph Hellwig, Linus Torvalds, Jan Kara,
	Andrew Morton, Ingo Molnar, Alan Cox, Arjan van de Ven,
	Peter Zijlstra, Nick Piggin, Jens Axboe, David Rees,
	Jesper Krogh, Linux Kernel Mailing List

On Wed, Mar 25, 2009 at 10:47 PM, Matthew Garrett <mjg59@srcf.ucam.org> wrote:
> On Wed, Mar 25, 2009 at 10:44:44PM -0400, Kyle Moffett wrote:
>
>>    Perhaps we ought to add a couple extra open flags, O_BARRIER_BEFORE and
>>    O_BARRIER_AFTER, and rename3(), etc functions that take flags arguments?
>>    Or maybe a new set of syscalls like barrier(file1, file2) and
>>    fbarrier(fd1, fd2), which cause all pending changes (perhaps limit to this
>>    process?) to the file at fd1 to occur before any successive changes (again
>>    limited to this process?) to the file at fd2.
>
> That's an option, but what would benefit? If rename is expected to
> preserve ordering (which I think it has to, in order to avoid breaking
> existing code) then are there any other interesting use cases?

The use cases would be programs like GIT (or any other kind of
database) where you want to ensure that your new pulled packfile has
fully hit disk before the ref update does.  If that ordering
constraint is applied, then we don't really care when we crash,
because either we have a partial packfile update (and we have to pull
again) or we have the whole thing.  The rename() barrier would ensure
that we either have the old ref or the new ref, but it would not check
to ensure that the whole packfile is on disk yet.

I would imagine that databases like MySQL could also use such support
to help speed up their database transaction support, instead of having
to run a bunch of threads which fsync() and buffer data internally.

Cheers,
Kyle Moffett

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-26  1:34                                 ` Linus Torvalds
@ 2009-03-26  2:59                                   ` Theodore Tso
  2009-03-26 16:24                                   ` Jan Kara
  1 sibling, 0 replies; 664+ messages in thread
From: Theodore Tso @ 2009-03-26  2:59 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jan Kara, Andrew Morton, Ingo Molnar, Alan Cox, Arjan van de Ven,
	Peter Zijlstra, Nick Piggin, Jens Axboe, David Rees,
	Jesper Krogh, Linux Kernel Mailing List

On Wed, Mar 25, 2009 at 06:34:32PM -0700, Linus Torvalds wrote:
> 
> Of course, the approach of just checking whether the buffer heads already 
> exists and are mapped (before bothering with anything else) probably works 
> fine in practice. In most loads, pages will have been dirtied by regular 
> "write()" system calls, and then we will have the buffers pre-allocated 
> regardless. 
> 

Yeah, I agree; solving the problem in the case of files being dirtied
via write() is going to solve a much percentage of the cases compared
to those cases where the pages are dirtied via mmap()'ed pages.

I thought we were doing this already, but clearly I should have looked
at the code first.  :-(

						- Ted

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-26  2:51                                 ` Jeff Garzik
@ 2009-03-26  3:03                                   ` Kyle Moffett
  2009-03-26  3:40                                     ` Linus Torvalds
  0 siblings, 1 reply; 664+ messages in thread
From: Kyle Moffett @ 2009-03-26  3:03 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Matthew Garrett, Theodore Tso, Christoph Hellwig, Linus Torvalds,
	Jan Kara, Andrew Morton, Ingo Molnar, Alan Cox, Arjan van de Ven,
	Peter Zijlstra, Nick Piggin, Jens Axboe, David Rees,
	Jesper Krogh, Linux Kernel Mailing List

On Wed, Mar 25, 2009 at 10:51 PM, Jeff Garzik <jeff@garzik.org> wrote:
> Then you have just reinvented the transactional userspace API that people
> often want to replace POSIX API with.  Maybe one day they will succeed.
>
> But "POSIX API replacement" is an area never short of proposals... :)

Well, I think the goal is not to *replace* the POSIX API or even
provide "transactional" guarantees.  The performance penalty for
atomic transactions is pretty high, and most programs (like GIT) don't
really give a damn, as they provide that on a higher level.

It's like the difference between a modern SMP system that supports
memory barriers and write snooping and one of the theoretical
"transactional memory" designs that have never caught on.

To be honest I think we could provide much better data consistency
guarantees and remove a lot of fsync() calls with just a basic
per-filesystem barrier() call.

Cheers,
Kyle Moffett

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-26  2:50               ` Neil Brown
@ 2009-03-26  3:13                 ` Theodore Tso
  0 siblings, 0 replies; 664+ messages in thread
From: Theodore Tso @ 2009-03-26  3:13 UTC (permalink / raw)
  To: Neil Brown
  Cc: Linus Torvalds, David Rees, Jesper Krogh, Linux Kernel Mailing List

On Thu, Mar 26, 2009 at 01:50:10PM +1100, Neil Brown wrote:
> I shouldn't be too hard to add some concept of total time to this.
> If we track the number of write-outs per unit time and use that together
> with a "target time for fsync" to scale the 'dirty_bytes' number, we
> might be able to auto-tune the amount of dirty space to fit the speeds
> of the drives.
> 
> We would probably start with each device having a very low "max dirty"
> number which would cause writeouts to start soon.  Once the device
> demonstrates that it can do n-per-second (or whatever) the VM would
> allow the "max dirty" number to drift upwards.  I'm not sure how best
> to get it to move downwards if the device slows down (or the kernel
> over-estimated).  Maybe it should regularly decay so that the device
> keeps have to "prove" itself.

This seems like a really cool idea.

						-Ted

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Revert "gro: Fix legacy path napi_complete crash",
  2009-03-26  2:41                         ` Herbert Xu
@ 2009-03-26  3:20                           ` David Miller
  2009-03-26  3:40                             ` Herbert Xu
  2009-03-26  9:18                             ` Jarek Poplawski
  2009-03-26  7:39                           ` Jarek Poplawski
  1 sibling, 2 replies; 664+ messages in thread
From: David Miller @ 2009-03-26  3:20 UTC (permalink / raw)
  To: herbert
  Cc: jarkao2, mingo, r.schwebel, torvalds, blaschka, tglx,
	a.p.zijlstra, linux-kernel, kernel

From: Herbert Xu <herbert@gondor.apana.org.au>
Date: Thu, 26 Mar 2009 10:41:29 +0800

> On Wed, Mar 25, 2009 at 11:54:56PM +0100, Jarek Poplawski wrote:
> > 
> > Of course it's too late for verifying this now, but (for the future)
> > I think, this scenario could be considered:
> > 
> > process_backlog()			netif_rx()
> > 
> > if (!skb)
> > local_irq_enable()
> > 					if (queue.qlen) //NO
> > 					napi_schedule() //NOTHING
> > 					__skb_queue_tail() //qlen > 0
> > napi_complete()
> > ...					...
> > 					Every next netif_rx() sees
> > 					qlen > 0, so napi is never
> > 					scheduled again.
> > 
> > Then, something like this might work...
> 
> Yes this is why my original patch that started all this is broken.
> However, this doesn't apply to my patch that open-codes __napi_complete.

There is still a difference compared to your fix Herbert.  Jarek's
patch flushes GRO first before the unlink.

I still believe that's critical, although like you I can't pinpoint
why.

I know that GRO ought to be disabled here, but what if for some reason
it isn't? :-)

Adam Richter has successfully tested Jarek's variant, and if Ingo's
tests show that it makes his problem go away too then I'm definitely
going with Jarek's patch.

^ permalink raw reply	[flat|nested] 664+ messages in thread

* [PATCH v2] issue storage device flush via sync_blockdev()
  2009-03-25 21:29                                 ` [PATCH] issue storage device flush via sync_blockdev() (was Re: Linux 2.6.29) Jeff Garzik
  2009-03-25 21:56                                   ` Eric Sandeen
  2009-03-25 22:01                                   ` Alan Cox
@ 2009-03-26  3:24                                   ` Jeff Garzik
  2009-03-27  2:50                                     ` Theodore Tso
  2009-03-27 20:50                                     ` [PATCH] issue storage dev flush from generic file_fsync helper Jeff Garzik
  2 siblings, 2 replies; 664+ messages in thread
From: Jeff Garzik @ 2009-03-26  3:24 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Theodore Tso, Ingo Molnar, Alan Cox, Arjan van de Ven,
	Andrew Morton, Peter Zijlstra, Nick Piggin, David Rees,
	Jesper Krogh, Linux Kernel Mailing List

Flush storage dev writeback cache, for each call to sync_blockdev().

sync_blockdev() is used primarily for two purposes:

1) To flush all data to permanent storage prior to a major event,
such as: unmount, journal close, unplug, poweroff, explosion, ...

2) As a "sledgehammer hack" to provide fsync(2) via file_fsync to
filesystems, generally simple or legacy filesystems, such as HFS,
HFS+, ADFS, AFFS, bfs, UFS, NTFS, qnx4 and FAT.

This change guarantees that the underlying storage device will have
flushed any pending data in its writeback cache to permanent media,
before it returns (...if underlying storage dev supports flushes).

Signed-off-by: Jeff Garzik <jgarzik@redhat.com>
---
Changes since last patch:
- do not return error, if storage dev does not support flushes at all
  (-EOPNOTSUPP)

diff --git a/fs/buffer.c b/fs/buffer.c
index 891e1c7..e04d7a4 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -173,8 +173,17 @@ int sync_blockdev(struct block_device *bdev)
 {
 	int ret = 0;
 
-	if (bdev)
-		ret = filemap_write_and_wait(bdev->bd_inode->i_mapping);
+	if (!bdev)
+		return 0;
+	
+	ret = filemap_write_and_wait(bdev->bd_inode->i_mapping);
+	if (ret)
+		return ret;
+	
+	ret = blkdev_issue_flush(bdev, NULL);
+	if (ret == -EOPNOTSUPP)
+		ret = 0;
+	
 	return ret;
 }
 EXPORT_SYMBOL(sync_blockdev);

^ permalink raw reply related	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-26  3:03                                   ` Kyle Moffett
@ 2009-03-26  3:40                                     ` Linus Torvalds
  2009-03-26  3:57                                       ` David Miller
  2009-03-26  4:58                                       ` Kyle Moffett
  0 siblings, 2 replies; 664+ messages in thread
From: Linus Torvalds @ 2009-03-26  3:40 UTC (permalink / raw)
  To: Kyle Moffett
  Cc: Jeff Garzik, Matthew Garrett, Theodore Tso, Christoph Hellwig,
	Jan Kara, Andrew Morton, Ingo Molnar, Alan Cox, Arjan van de Ven,
	Peter Zijlstra, Nick Piggin, Jens Axboe, David Rees,
	Jesper Krogh, Linux Kernel Mailing List



On Wed, 25 Mar 2009, Kyle Moffett wrote:
> 
> Well, I think the goal is not to *replace* the POSIX API or even
> provide "transactional" guarantees.  The performance penalty for
> atomic transactions is pretty high, and most programs (like GIT) don't
> really give a damn, as they provide that on a higher level.

Speaking with my 'git' hat on, I can tell that

 - git was designed to have almost minimal requirements from the 
   filesystem, and to not do anything even half-way clever.

 - despite that, we've hit an absolute metric sh*tload of filesystem bugs 
   and misfeatures. Some very much in Linux. And some I bet git was the 
   first to ever notice, exactly because git tries to be really anal, in 
   ways that I can pretty much guarantee no normal program _ever_ is.

For example, the latest one came from git actually checking the error code 
from 'close()'. Tell me the last time you saw anybody do that in a real 
program. Hint: it's just not done. EVER. Git does it (and even then, git 
does it only for the core git object files that we care about so much), 
and we found a real data-loss CIFS bug thanks to that. Afaik, the bug has 
been there for a year and half. Don't tell me nobody uses cifs.

Before that, we had cross-directory rename bugs. Or the inexplicable 
"pread() doesn't work correctly on HP-UX". Or the "readdir() returns the 
same entry multiple times" bug. And all of this without ever doing 
anything even _remotely_ odd. No file locking, no rewriting of old files, 
no lseek()ing in directories, no nothing.

Anybody who wants more complex and subtle filesystem interfaces is just 
crazy. Not only will they never get used, they'll definitely not be 
stable.

> To be honest I think we could provide much better data consistency
> guarantees and remove a lot of fsync() calls with just a basic
> per-filesystem barrier() call.

The problem is not that we have a lot of fsync() calls. Quite the reverse. 
fsync() is really really rare. So is being careful in general. The number 
of applications that do even the _minimal_ safety-net of "create new file, 
rename it atomically over an old one" is basically zero. Almost everybody 
ends up rewriting files with something like

	open(name, O_CREAT | O_TRUNC, 0666)
	write();
	close();

where there isn't an fsync in sight, nor any "create temp file", nor 
likely even any real error checking on the write(), much less the 
close().

And if we have a Linux-specific magic system call or sync action, it's 
going to be even more rarely used than fsync(). Do you think anybody 
really uses the OS X FSYNC_FULL ioctl? Nope. Outside of a few databases, 
it is almost certainly not going to be used, and fsync() will not be 
reliable in general.

So rather than come up with new barriers that nobody will use, filesystem 
people should aim to make "badly written" code "just work" unless people 
are really really unlucky. Because like it or not, that's what 99% of all 
code is.

The undeniable FACT that people don't tend to check errors from close() 
should, for example, mean that delayed allocation must still track disk 
full conditions, for example. If your filesystem returns ENOSPC at close() 
rather than at write(), you just lost error coverage for disk full cases 
from 90% of all apps. It's that simple.

Crying that it's an application bug is like crying over the speed of 
light: you should deal with *reality*, not what you wish reality was. Same 
goes for any complaints that "people should write a temp-file, fsync it, 
and rename it over the original". You may wish that was what they did, but 
reality is that "open(filename, O_TRUNC | O_CREAT, 0666)" thing.

Harsh, I know. And in the end, even the _good_ applications will decide 
that it's not worth the performance penalty of doing an fsync(). In git, 
for example, where we generally try to be very very very careful, 
'fsync()' on the object files is turned off by default.

Why? Because turning it on results in unacceptable behavior on ext3. Now, 
admittedly, the git design means that a lost new DB file isn't deadly, 
just potentially very very annoying and confusing - you may have to roll 
back and re-do your operation by hand, and you have to know enough to be 
able to do it in the first place.

The point here? Sometimes those filesystem people who say "you must use 
fsync() to get well-defined semantics" are the same people who SCREWED IT 
UP SO DAMN BADLY THAT FSYNC ISN'T ACTUALLY REALISTICALLY USEABLE!

Theory and practice sometimes clash. And when that happens, theory loses. 
Every single time. 

		Linus

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Revert "gro: Fix legacy path napi_complete crash",
  2009-03-26  3:20                           ` David Miller
@ 2009-03-26  3:40                             ` Herbert Xu
  2009-03-26  9:18                             ` Jarek Poplawski
  1 sibling, 0 replies; 664+ messages in thread
From: Herbert Xu @ 2009-03-26  3:40 UTC (permalink / raw)
  To: David Miller
  Cc: jarkao2, mingo, r.schwebel, torvalds, blaschka, tglx,
	a.p.zijlstra, linux-kernel, kernel

On Wed, Mar 25, 2009 at 08:20:50PM -0700, David Miller wrote:
>
> There is still a difference compared to your fix Herbert.  Jarek's
> patch flushes GRO first before the unlink.
> 
> I still believe that's critical, although like you I can't pinpoint
> why.
> 
> I know that GRO ought to be disabled here, but what if for some reason
> it isn't? :-)

Sure, I can accept that somehow someone has enabled GRO :) But
I'd still like to know why flushing the GRO afterwards would
lead to a hang.

> Adam Richter has successfully tested Jarek's variant, and if Ingo's
> tests show that it makes his problem go away too then I'm definitely
> going with Jarek's patch.

I don't have a problem with that.

I've asked Adam to test my patch as well.  So far the only failure
case against it has been on Ingo's machine, where the process_backlog
path isn't involved.  Yes it's used for loopback, but he's seeing
the hang on eth0, which with that config uses netif_receive_skb
so it should continue to work even if process_backlog completely
seizes up.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-26  3:40                                     ` Linus Torvalds
@ 2009-03-26  3:57                                       ` David Miller
  2009-03-26  4:58                                       ` Kyle Moffett
  1 sibling, 0 replies; 664+ messages in thread
From: David Miller @ 2009-03-26  3:57 UTC (permalink / raw)
  To: torvalds
  Cc: kyle, jeff, mjg59, tytso, hch, jack, akpm, mingo, alan, arjan,
	a.p.zijlstra, npiggin, jens.axboe, drees76, jesper, linux-kernel

From: Linus Torvalds <torvalds@linux-foundation.org>
Date: Wed, 25 Mar 2009 20:40:23 -0700 (PDT)

> For example, the latest one came from git actually checking the error code 
> from 'close()'. Tell me the last time you saw anybody do that in a real 
> program. Hint: it's just not done. EVER.

Emacs does it too, and I know that you consider GNU emacs to be the
definition of abnormal :-)

That's how we found some misbehaviors in NFS a while ago, we used to
return -EAGAIN or something like that from close() on NFS files.  This
was like 12 years ago and it gave emacs massive heartburn.

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-26  3:40                                     ` Linus Torvalds
  2009-03-26  3:57                                       ` David Miller
@ 2009-03-26  4:58                                       ` Kyle Moffett
  2009-03-26  6:24                                         ` Jeff Garzik
  1 sibling, 1 reply; 664+ messages in thread
From: Kyle Moffett @ 2009-03-26  4:58 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jeff Garzik, Matthew Garrett, Theodore Tso, Christoph Hellwig,
	Jan Kara, Andrew Morton, Ingo Molnar, Alan Cox, Arjan van de Ven,
	Peter Zijlstra, Nick Piggin, Jens Axboe, David Rees,
	Jesper Krogh, Linux Kernel Mailing List

On Wed, Mar 25, 2009 at 11:40 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Wed, 25 Mar 2009, Kyle Moffett wrote:
>> To be honest I think we could provide much better data consistency
>> guarantees and remove a lot of fsync() calls with just a basic
>> per-filesystem barrier() call.
>
> The problem is not that we have a lot of fsync() calls. Quite the reverse.
> fsync() is really really rare. So is being careful in general. The number
> of applications that do even the _minimal_ safety-net of "create new file,
> rename it atomically over an old one" is basically zero. Almost everybody
> ends up rewriting files with something like
>
>        open(name, O_CREAT | O_TRUNC, 0666)
>        write();
>        close();
>
> where there isn't an fsync in sight, nor any "create temp file", nor
> likely even any real error checking on the write(), much less the
> close().

Really, I think virtually all of the database programs would be
perfectly happy with an "fsbarrier(fd, flags)" syscall, where if "fd"
points to a regular file or directory then it instructs the underlying
filesystem to do whatever internal barrier it supports, and if not
just fail with -ENOTSUPP (so you can fall back to fdatasync(), etc).
Perhaps "flags" would allow a "data" or "metadata" barrier, but if not
it's not a big issue.

I've ended up having to write a fair amount of high-performance
filesystem library code which almost never ends up using fsync() quite
simply because the performance on it sucks so badly.  This is one of
the big reasons why so many critical database programs use O_DIRECT
and reinvent the the wheel^H^H^H^H^H^H pagecache.  The only way you
can actually use it in high-bandwidth transaction applications is by
doing your own IO-thread and buffering system.

You have to have your own buffer ordering dependencies and call
fdatasync() or fsync() from individual threads in-between specific
ordered IOs.  The threading helps you keep other IO in flight while
waiting for the flush to finish.  For big databases on spinning media
(SSDs don't work precisely because they are small and your databases
are big) the overhead of a full flush may still be too large.  Even
with SSDs, with multiple processes vying for IO bandwidth you still
want some kind of application-level barrier to avoid introducing
bubbles in your IO pipeline.

It all comes down to a trivial calculation: if you can't get
(bandwidth * latency-to-stable-storage) bytes of data queued *behind*
a flush then your disk is going to sit idle waiting for more data
after completing it.  If a user-level tool needs to enforce ordering
between IOs the only tool right now is is a full flush; when
database-oriented tools can use a barrier()-ish call instead, they can
issue the op and immediately resume keeping the IO queues full.

Cheers,
Kyle Moffett

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-26  4:58                                       ` Kyle Moffett
@ 2009-03-26  6:24                                         ` Jeff Garzik
  2009-03-26 12:49                                           ` Kyle Moffett
  0 siblings, 1 reply; 664+ messages in thread
From: Jeff Garzik @ 2009-03-26  6:24 UTC (permalink / raw)
  To: Kyle Moffett
  Cc: Linus Torvalds, Matthew Garrett, Theodore Tso, Christoph Hellwig,
	Jan Kara, Andrew Morton, Ingo Molnar, Alan Cox, Arjan van de Ven,
	Peter Zijlstra, Nick Piggin, Jens Axboe, David Rees,
	Jesper Krogh, Linux Kernel Mailing List

Kyle Moffett wrote:
> Really, I think virtually all of the database programs would be
> perfectly happy with an "fsbarrier(fd, flags)" syscall, where if "fd"
> points to a regular file or directory then it instructs the underlying
> filesystem to do whatever internal barrier it supports, and if not
> just fail with -ENOTSUPP (so you can fall back to fdatasync(), etc).
> Perhaps "flags" would allow a "data" or "metadata" barrier, but if not
> it's not a big issue.

If you want a per-fd barrier call, there is always sync_file_range(2)


> If a user-level tool needs to enforce ordering
> between IOs the only tool right now is is a full flush

or sync_file_range(2)...

	Jeff




^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Revert "gro: Fix legacy path napi_complete crash",
  2009-03-26  0:10                         ` David Miller
@ 2009-03-26  6:43                           ` Jarek Poplawski
  2009-03-26  7:52                             ` David Miller
  0 siblings, 1 reply; 664+ messages in thread
From: Jarek Poplawski @ 2009-03-26  6:43 UTC (permalink / raw)
  To: David Miller
  Cc: herbert, mingo, r.schwebel, torvalds, blaschka, tglx,
	a.p.zijlstra, linux-kernel, kernel

On Wed, Mar 25, 2009 at 05:10:19PM -0700, David Miller wrote:
> From: Jarek Poplawski <jarkao2@gmail.com>
> Date: Wed, 25 Mar 2009 23:54:56 +0100
> 
> Ingo, in case it isn't completely obvious, it would be
> wonderful if you could try Jarek's patch below with your
> test case.
> 
> Thanks!

David, of course testing my patch would be very nice, but I think we
should definitely & immediately have this tested already Herbert's
revert patch in -stable now. 

Thanks,
Jarek P.

> 
> > Herbert Xu wrote, On 03/25/2009 01:26 PM:
> > 
> > > On Wed, Mar 25, 2009 at 01:20:46PM +0100, Ingo Molnar wrote:
> > >> ok - i have started testing the delta below, on top of the plain 
> > >> revert.
> > > 
> > > Thanks! BTW Ingo, any chance you could help us identify the problem
> > > with the previous patch? I don't have a forcedeth machine here
> > > and the hang you had with my patch that open-coded __napi_complete
> > > appears intimately connected to forcedeth (with NAPI enabled).
> > 
> > Of course it's too late for verifying this now, but (for the future)
> > I think, this scenario could be considered:
> > 
> > process_backlog()			netif_rx()
> > 
> > if (!skb)
> > local_irq_enable()
> > 					if (queue.qlen) //NO
> > 					napi_schedule() //NOTHING
> > 					__skb_queue_tail() //qlen > 0
> > napi_complete()
> > ...					...
> > 					Every next netif_rx() sees
> > 					qlen > 0, so napi is never
> > 					scheduled again.
> > 
> > Then, something like this might work...
> > 
> > Jarek P.
> > --- (2.6.29)
> >  net/core/dev.c |    6 +++++-
> >  1 files changed, 5 insertions(+), 1 deletions(-)
> > 
> > diff --git a/net/core/dev.c b/net/core/dev.c
> > index e3fe5c7..cf53c24 100644
> > --- a/net/core/dev.c
> > +++ b/net/core/dev.c
> > @@ -2589,7 +2589,11 @@ static int process_backlog(struct napi_struct *napi, int quota)
> >  		skb = __skb_dequeue(&queue->input_pkt_queue);
> >  		if (!skb) {
> >  			local_irq_enable();
> > -			napi_complete(napi);
> > +			napi_gro_flush(napi);
> > +			local_irq_disable();
> > +			if (skb_queue_empty(&queue->input_pkt_queue))
> > +				__napi_complete(napi);
> > +			local_irq_enable();
> >  			goto out;
> >  		}
> >  		local_irq_enable();

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Revert "gro: Fix legacy path napi_complete crash",
  2009-03-26  2:41                         ` Herbert Xu
  2009-03-26  3:20                           ` David Miller
@ 2009-03-26  7:39                           ` Jarek Poplawski
  1 sibling, 0 replies; 664+ messages in thread
From: Jarek Poplawski @ 2009-03-26  7:39 UTC (permalink / raw)
  To: Herbert Xu
  Cc: Ingo Molnar, David Miller, r.schwebel, torvalds, blaschka, tglx,
	a.p.zijlstra, linux-kernel, kernel

On Thu, Mar 26, 2009 at 10:41:29AM +0800, Herbert Xu wrote:
...
> Yes this is why my original patch that started all this is broken.
> However, this doesn't apply to my patch that open-codes __napi_complete.
...
> This should work too.  However, the fact that the following patch
> is broken means that we have bigger problems.

I agree. This should be effectively equal to the revert, so I guess
Ingo probably hit some other bug btw.

Jarek P.

> 
> net: Fix netpoll lockup in legacy receive path
> 
> When I fixed the GRO crash in the legacy receive path I used
> napi_complete to replace __napi_complete.  Unfortunately they're
> not the same when NETPOLL is enabled, which may result in us
> not calling __napi_complete at all.
> 
> What's more, we really do need to keep the __napi_complete call
> within the IRQ-off section since in theory an IRQ can occur in
> between and fill up the backlog to the maximum, causing us to
> lock up.
> 
> This patch fixes this by essentially open-coding __napi_complete.
> 
> Note we no longer need the memory barrier because this function
> is per-cpu.
> 
> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
> 
> diff --git a/net/core/dev.c b/net/core/dev.c
> index e3fe5c7..2a7f6b3 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -2588,9 +2588,10 @@ static int process_backlog(struct napi_struct *napi, int quota)
>  		local_irq_disable();
>  		skb = __skb_dequeue(&queue->input_pkt_queue);
>  		if (!skb) {
> +			list_del(&napi->poll_list);
> +			clear_bit(NAPI_STATE_SCHED, &napi->state);
>  			local_irq_enable();
> -			napi_complete(napi);
> -			goto out;
> +			break;
>  		}
>  		local_irq_enable();
>  
> @@ -2599,7 +2600,6 @@ static int process_backlog(struct napi_struct *napi, int quota)
>  
>  	napi_gro_flush(napi);
>  
> -out:
>  	return work;
>  }
> 
> Thanks,
> -- 
> Visit Openswan at http://www.openswan.org/
> Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
> Home Page: http://gondor.apana.org.au/~herbert/
> PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Revert "gro: Fix legacy path napi_complete crash",
  2009-03-26  6:43                           ` Jarek Poplawski
@ 2009-03-26  7:52                             ` David Miller
  2009-03-26  7:59                               ` Jarek Poplawski
  0 siblings, 1 reply; 664+ messages in thread
From: David Miller @ 2009-03-26  7:52 UTC (permalink / raw)
  To: jarkao2
  Cc: herbert, mingo, r.schwebel, torvalds, blaschka, tglx,
	a.p.zijlstra, linux-kernel, kernel

From: Jarek Poplawski <jarkao2@gmail.com>
Date: Thu, 26 Mar 2009 06:43:17 +0000

> On Wed, Mar 25, 2009 at 05:10:19PM -0700, David Miller wrote:
> > From: Jarek Poplawski <jarkao2@gmail.com>
> > Date: Wed, 25 Mar 2009 23:54:56 +0100
> > 
> > Ingo, in case it isn't completely obvious, it would be
> > wonderful if you could try Jarek's patch below with your
> > test case.
> > 
> > Thanks!
> 
> David, of course testing my patch would be very nice, but I think we
> should definitely & immediately have this tested already Herbert's
> revert patch in -stable now. 

Sure, I'll do that.

We can toy with this thing during 2.6.30 development.

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Revert "gro: Fix legacy path napi_complete crash",
  2009-03-25 12:08                 ` Herbert Xu
  2009-03-25 12:20                   ` Ingo Molnar
@ 2009-03-26  7:59                   ` David Miller
  1 sibling, 0 replies; 664+ messages in thread
From: David Miller @ 2009-03-26  7:59 UTC (permalink / raw)
  To: herbert
  Cc: mingo, r.schwebel, torvalds, blaschka, tglx, a.p.zijlstra,
	linux-kernel, kernel

From: Herbert Xu <herbert@gondor.apana.org.au>
Date: Wed, 25 Mar 2009 20:08:40 +0800

> GRO: Disable GRO on legacy netif_rx path
> 
> When I fixed the GRO crash in the legacy receive path I used
> napi_complete to replace __napi_complete.  Unfortunately they're
> not the same when NETPOLL is enabled, which may result in us
> not calling __napi_complete at all.
> 
> What's more, we really do need to keep the __napi_complete call
> within the IRQ-off section since in theory an IRQ can occur in
> between and fill up the backlog to the maximum, causing us to
> lock up.
> 
> Since we can't seem to find a fix that works properly right now,
> this patch reverts all the GRO support from the netif_rx path.
> 
> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

Applied, and will push to -stable.

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Revert "gro: Fix legacy path napi_complete crash",
  2009-03-26  7:52                             ` David Miller
@ 2009-03-26  7:59                               ` Jarek Poplawski
  0 siblings, 0 replies; 664+ messages in thread
From: Jarek Poplawski @ 2009-03-26  7:59 UTC (permalink / raw)
  To: David Miller
  Cc: herbert, mingo, r.schwebel, torvalds, blaschka, tglx,
	a.p.zijlstra, linux-kernel, kernel

On Thu, Mar 26, 2009 at 12:52:12AM -0700, David Miller wrote:
> From: Jarek Poplawski <jarkao2@gmail.com>
> Date: Thu, 26 Mar 2009 06:43:17 +0000
> 
> > On Wed, Mar 25, 2009 at 05:10:19PM -0700, David Miller wrote:
> > > From: Jarek Poplawski <jarkao2@gmail.com>
> > > Date: Wed, 25 Mar 2009 23:54:56 +0100
> > > 
> > > Ingo, in case it isn't completely obvious, it would be
> > > wonderful if you could try Jarek's patch below with your
> > > test case.
> > > 
> > > Thanks!
> > 
> > David, of course testing my patch would be very nice, but I think we
> > should definitely & immediately have this tested already Herbert's
> > revert patch in -stable now. 
> 
> Sure, I'll do that.
> 
> We can toy with this thing during 2.6.30 development.

Or add this to the next -stable after testing.

Jarek P.

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: [PATCH] issue storage device flush via sync_blockdev() (was Re: Linux 2.6.29)
  2009-03-26  1:26                                       ` Jeff Garzik
  2009-03-26  1:33                                         ` Jeff Garzik
@ 2009-03-26  8:24                                         ` Christoph Hellwig
  1 sibling, 0 replies; 664+ messages in thread
From: Christoph Hellwig @ 2009-03-26  8:24 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Ric Wheeler, Eric Sandeen, Linus Torvalds, Theodore Tso,
	Ingo Molnar, Alan Cox, Arjan van de Ven, Andrew Morton,
	Peter Zijlstra, Nick Piggin, David Rees, Jesper Krogh,
	Linux Kernel Mailing List

On Wed, Mar 25, 2009 at 09:26:11PM -0400, Jeff Garzik wrote:
> What do you think sync_blockdev() does?  What is its purpose?

It writes out data in the block device inode.  Which does not include
any user data, and might not contain anything at all for filesystems
that have their own address space for metadata.  It's defintively the
wrong place for this kind of hack.


^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-25 20:41                                 ` Hugh Dickins
@ 2009-03-26  8:57                                   ` Jens Axboe
  2009-03-26 14:47                                     ` Hugh Dickins
  0 siblings, 1 reply; 664+ messages in thread
From: Jens Axboe @ 2009-03-26  8:57 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Ric Wheeler, Jeff Garzik, Linus Torvalds, Theodore Tso,
	Ingo Molnar, Alan Cox, Arjan van de Ven, Andrew Morton,
	Peter Zijlstra, Nick Piggin, David Rees, Jesper Krogh,
	Linux Kernel Mailing List

On Wed, Mar 25 2009, Hugh Dickins wrote:
> On Wed, 25 Mar 2009, Jens Axboe wrote:
> > On Wed, Mar 25 2009, Ric Wheeler wrote:
> > > Jens Axboe wrote:
> > >>
> > >> Another problem is that FLUSH_CACHE sucks. Really. And not just on
> > >> ext3/ordered, generally. Write a 50 byte file, fsync, flush cache and
> > >> wit for the world to finish. Pretty hard to teach people to use a nicer
> > >> fdatasync(), when the majority of the cost now becomes flushing the
> > >> cache of that 1TB drive you happen to have 8 partitions on. Good luck
> > >> with that.
> > >>   
> > > And, as I am sure that you do know, to add insult to injury, FLUSH_CACHE  
> > > is per device (not file system).
> > >
> > > When you issue an fsync() on a disk with multiple partitions, you will  
> > > flush the data for all of its partitions from the write cache....
> > 
> > Exactly, that's what my (vague) 8 partition reference was for :-)
> > A range flush would be so much more palatable.
> 
> Tangential question, but am I right in thinking that BIO_RW_BARRIER
> similarly bars across all partitions, whereas its WRITE_BARRIER and
> DISCARD_BARRIER users would actually prefer it to apply to just one?

All the barriers refer to just that range which the barrier itself
references. The problem with the full device flushes is implementation
on the hardware side, since we can't do small range flushes. So it's not
as-designed, but rather the best we can do...

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-25 21:22                                   ` James Bottomley
@ 2009-03-26  8:59                                     ` Jens Axboe
  2009-03-30 19:05                                     ` range-based cache flushing (was Re: Linux 2.6.29) Jeff Garzik
  1 sibling, 0 replies; 664+ messages in thread
From: Jens Axboe @ 2009-03-26  8:59 UTC (permalink / raw)
  To: James Bottomley
  Cc: Ric Wheeler, Jeff Garzik, Linus Torvalds, Theodore Tso,
	Ingo Molnar, Alan Cox, Arjan van de Ven, Andrew Morton,
	Peter Zijlstra, Nick Piggin, David Rees, Jesper Krogh,
	Linux Kernel Mailing List

On Wed, Mar 25 2009, James Bottomley wrote:
> On Wed, 2009-03-25 at 16:25 -0400, Ric Wheeler wrote:
> > Jeff Garzik wrote:
> > > Ric Wheeler wrote:> And, as I am sure that you do know, to add insult 
> > > to injury, FLUSH_CACHE
> > >> is per device (not file system).
> > >>
> > >> When you issue an fsync() on a disk with multiple partitions, you 
> > >> will flush the data for all of its partitions from the write cache....
> > >
> > > SCSI'S SYNCHRONIZE CACHE command already accepts an (LBA, length) 
> > > pair.  We could make use of that.
> > >
> > > And I bet we could convince T13 to add FLUSH CACHE RANGE, if we could 
> > > demonstrate clear benefit.
> > >
> > >     Jeff
> > 
> > How well supported is this in SCSI?  Can we try it out with a commodity 
> > SAS drive?
> 
> What do you mean by well supported?  The way the SCSI standard is
> written, a device can do a complete cache flush when a range flush is
> requested and still be fully standards compliant.  There's no easy way
> to tell if it does a complete cache flush every time other than by
> taking the firmware apart (or asking the manufacturer).

That's the fear of range flushes, if it was added to t13 as well. Unless
that Other OS uses range flushes, most firmware writers would most
likely implement any range as 0...-1 and it wouldn't help us at all. In
fact it would make things worse, as we would have done extra work to
actually find these ranges, unless you went cheap and said 'just flush
this partition'.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 664+ messages in thread

* ext3 IO latency measurements (was: Linux 2.6.29)
  2009-03-25 23:50                             ` Jan Kara
  2009-03-26  0:04                               ` Linus Torvalds
@ 2009-03-26  9:06                               ` Ingo Molnar
  2009-03-26  9:09                                 ` Ingo Molnar
                                                   ` (6 more replies)
  1 sibling, 7 replies; 664+ messages in thread
From: Ingo Molnar @ 2009-03-26  9:06 UTC (permalink / raw)
  To: Jan Kara
  Cc: Linus Torvalds, Theodore Tso, Andrew Morton, Alan Cox,
	Arjan van de Ven, Peter Zijlstra, Nick Piggin, Jens Axboe,
	David Rees, Jesper Krogh, Linux Kernel Mailing List,
	Oleg Nesterov, Roland McGrath


* Jan Kara <jack@suse.cz> wrote:

> > So tell me again how the VM can rely on the filesystem not 
> > blocking at random points.
>
>   I can write a patch to make writepage() in the non-"mmapped 
> creation" case non-blocking on journal. But I'll also have to find 
> out whether it really helps something. But it's probably worth 
> trying...

_all_ the problems i ever had with ext3 were 'collateral damage' 
type of things: simple writes (sometimes even reads) getting 
serialized on some large [but reasonable] dirtying activity 
elsewhere - even if the system was still well within its 
hard-dirty-limit threshold.

So it sure sounds like an area worth improving, and it's not that 
hard to reproduce either. Take a system with enough RAM but only a 
single disk, and do this in a kernel tree:

  sync
  echo 3 > /proc/sys/vm/drop_caches

  while :; do
    date
    make mrproper      2>/dev/null >/dev/null
    make defconfig     2>/dev/null >/dev/null
    make -j32 bzImage  2>/dev/null >/dev/null
  done &

Plain old kernel build, no distcc and no icecream. Wait a few 
minutes for the system to reach equilibrium. There's no tweaking 
anywhere, kernel, distro and filesystem defaults used everywhere:

 aldebaran:/home/mingo/linux/linux> ./compile-test 
 Thu Mar 26 10:33:03 CET 2009
 Thu Mar 26 10:35:24 CET 2009
 Thu Mar 26 10:36:48 CET 2009
 Thu Mar 26 10:38:54 CET 2009
 Thu Mar 26 10:41:22 CET 2009
 Thu Mar 26 10:43:41 CET 2009
 Thu Mar 26 10:46:02 CET 2009
 Thu Mar 26 10:48:28 CET 2009

And try to use the system while this workload is going on. Use Vim 
to edit files in this kernel tree. Use plain _cat_ - and i hit 
delays all the time - and it's not the CPU scheduler but all IO 
related.

I have such an ext3 based system where i can do such tests and where 
i dont mind crashes and data corruption either, so if you send me 
experimental patches against latet -git i can try them immediately. 
The system has 16 CPUs, 12GB of RAM and a single disk.

Btw., i had this test going on that box while i wrote some simple 
scripts in Vim - and it was a horrible experience. The worst wait 
was well above one minute - Vim just hung there indefinitely. Not 
even Ctrl-Z was possible. I captured one such wait, it was hanging 
right here:

 aldebaran:~/linux/linux> cat /proc/3742/stack
 [<ffffffff8034790a>] log_wait_commit+0xbd/0x110
 [<ffffffff803430b2>] journal_stop+0x1df/0x20d
 [<ffffffff8034421f>] journal_force_commit+0x28/0x2d
 [<ffffffff80331c69>] ext3_force_commit+0x2b/0x2d
 [<ffffffff80328b56>] ext3_write_inode+0x3e/0x44
 [<ffffffff802ebb9d>] __sync_single_inode+0xc1/0x2ad
 [<ffffffff802ebed6>] __writeback_single_inode+0x14d/0x15a
 [<ffffffff802ebf0c>] sync_inode+0x29/0x34
 [<ffffffff80327453>] ext3_sync_file+0xa7/0xb4
 [<ffffffff802ef17d>] vfs_fsync+0x78/0xaf
 [<ffffffff802ef1eb>] do_fsync+0x37/0x4d
 [<ffffffff802ef228>] sys_fsync+0x10/0x14
 [<ffffffff8020bd1b>] system_call_fastpath+0x16/0x1b
 [<ffffffffffffffff>] 0xffffffffffffffff

It took about 120 seconds for it to recover.

And it's not just sys_fsync(). The script i wrote tests file read 
latencies. I have created 1000 files with the same size (all copies 
of kernel/sched.c ;-), and tested their cache-cold plain-cat 
performance via:

  for ((i=0;i<1000;i++)); do
    printf "file #%4d, plain reading it took: " $i
    /usr/bin/time -f "%e seconds."  cat $i >/dev/null
  done

I.e. plain, supposedly high-prio reads. The result is very common 
hickups in read latencies:

file # 579 (253560 bytes), reading it took: 0.08 seconds.
file # 580 (253560 bytes), reading it took: 0.05 seconds.
file # 581 (253560 bytes), reading it took: 0.01 seconds.
file # 582 (253560 bytes), reading it took: 0.01 seconds.
file # 583 (253560 bytes), reading it took: 4.61 seconds.
file # 584 (253560 bytes), reading it took: 1.29 seconds.
file # 585 (253560 bytes), reading it took: 3.01 seconds.
file # 586 (253560 bytes), reading it took: 7.74 seconds.
file # 587 (253560 bytes), reading it took: 3.22 seconds.
file # 588 (253560 bytes), reading it took: 0.05 seconds.
file # 589 (253560 bytes), reading it took: 0.36 seconds.
file # 590 (253560 bytes), reading it took: 7.39 seconds.
file # 591 (253560 bytes), reading it took: 7.58 seconds.
file # 592 (253560 bytes), reading it took: 7.90 seconds.
file # 593 (253560 bytes), reading it took: 8.78 seconds.
file # 594 (253560 bytes), reading it took: 8.01 seconds.
file # 595 (253560 bytes), reading it took: 7.47 seconds.
file # 596 (253560 bytes), reading it took: 11.52 seconds.
file # 597 (253560 bytes), reading it took: 10.33 seconds.
file # 598 (253560 bytes), reading it took: 8.56 seconds.
file # 599 (253560 bytes), reading it took: 7.58 seconds.

The system's RAM is ridiculously under-utilized, 96.1% is free, only 
3.9% is utilized:

              total       used       free     shared    buffers     cached
 Mem:      12318192     476732   11841460          0      48324     142936
 -/+ buffers/cache:     285472   12032720
 Swap:      4096564          0    4096564

Dirty data in /proc/meminfo fluctuates between 0.4% and 1.6% of 
total RAM. (the script removes the freshly build kernel object 
files, so the workload is pretty steady.)

The peak of 1.6% looks like this:

Dirty:            118376 kB
Dirty:            143784 kB
Dirty:            161756 kB
Dirty:            185084 kB
Dirty:            210524 kB
Dirty:            213348 kB
Dirty:            200124 kB
Dirty:            122152 kB
Dirty:            121508 kB
Dirty:            121512 kB

(1 second snapshots)

So the problems are all around the place and they are absolutely, 
trivially reproducible. And this is how a default ext3 based distro 
and the default upstream kernel will present itself to new Linux 
users and developers. It's not a pretty experience.

Oh, and while at it - also a job control complaint. I tried to 
Ctrl-C the above script:

file # 858 (253560 bytes), reading it took: 0.06 seconds.
file # 859 (253560 bytes), reading it took: 0.02 seconds.
file # 860 (253560 bytes), reading it took: 5.53 seconds.
file # 861 (253560 bytes), reading it took: 3.70 seconds.
file # 862 (253560 bytes), reading it took: 0.88 seconds.
file # 863 (253560 bytes), reading it took: 0.04 seconds.
file # 864 (253560 bytes), reading it took: ^C0.69 seconds.
file # 865 (253560 bytes), reading it took: ^C0.49 seconds.
file # 866 (253560 bytes), reading it took: ^C0.01 seconds.
file # 867 (253560 bytes), reading it took: ^C0.02 seconds.
file # 868 (253560 bytes), reading it took: ^C^C0.01 seconds.
file # 869 (253560 bytes), reading it took: ^C^C0.04 seconds.
file # 870 (253560 bytes), reading it took: ^C^C^C0.03 seconds.
file # 871 (253560 bytes), reading it took: ^C0.02 seconds.
file # 872 (253560 bytes), reading it took: ^C^C0.02 seconds.
file # 873 (253560 bytes), reading it took: 
^C^C^C^Caldebaran:~/linux/linux/test-files/src> 

I had to hit Ctrl-C numerous times before Bash would honor it. This 
to is a very common thing on large SMP systems. I'm willing to test 
patches until all these problems are fixed. Any takers?

	Ingo

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Revert "gro: Fix legacy path napi_complete crash",
  2009-03-25 22:01                       ` Ingo Molnar
  2009-03-25 22:20                         ` Ken Witherow
@ 2009-03-26  9:07                         ` Herbert Xu
  2009-03-26  9:25                           ` Ingo Molnar
  1 sibling, 1 reply; 664+ messages in thread
From: Herbert Xu @ 2009-03-26  9:07 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: David Miller, r.schwebel, torvalds, blaschka, tglx, a.p.zijlstra,
	linux-kernel, kernel

On Wed, Mar 25, 2009 at 11:01:49PM +0100, Ingo Molnar wrote:
> 
> Sure, can try that. Probably the best would be if you sent me a 
> combo patch with the precise patch you meant me to try (there were 
> several patches, i'm not sure which one is the 'previous' one) plus 
> the forcedeth debug enable change as well.

Sure, here's the patch to do both.

diff --git a/net/core/dev.c b/net/core/dev.c
index e3fe5c7..2a7f6b3 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2588,9 +2588,10 @@ static int process_backlog(struct napi_struct *napi, int quota)
 		local_irq_disable();
 		skb = __skb_dequeue(&queue->input_pkt_queue);
 		if (!skb) {
+			list_del(&napi->poll_list);
+			clear_bit(NAPI_STATE_SCHED, &napi->state);
 			local_irq_enable();
-			napi_complete(napi);
-			goto out;
+			break;
 		}
 		local_irq_enable();
 
@@ -2599,7 +2600,6 @@ static int process_backlog(struct napi_struct *napi, int quota)
 
 	napi_gro_flush(napi);
 
-out:
 	return work;
 }

diff --git a/drivers/net/forcedeth.c b/drivers/net/forcedeth.c
index b8251e8..101e552 100644
--- a/drivers/net/forcedeth.c
+++ b/drivers/net/forcedeth.c
@@ -64,7 +64,7 @@
 #include <asm/uaccess.h>
 #include <asm/system.h>
 
-#if 0
+#if 1
 #define dprintk			printk
 #else
 #define dprintk(x...)		do { } while (0)

Thanks,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply related	[flat|nested] 664+ messages in thread

* Re: ext3 IO latency measurements (was: Linux 2.6.29)
  2009-03-26  9:06                               ` ext3 IO latency measurements (was: Linux 2.6.29) Ingo Molnar
@ 2009-03-26  9:09                                 ` Ingo Molnar
  2009-03-26 11:08                                 ` Jens Axboe
                                                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 664+ messages in thread
From: Ingo Molnar @ 2009-03-26  9:09 UTC (permalink / raw)
  To: Jan Kara, Oleg Nesterov, Roland McGrath
  Cc: Linus Torvalds, Theodore Tso, Andrew Morton, Alan Cox,
	Arjan van de Ven, Peter Zijlstra, Nick Piggin, Jens Axboe,
	David Rees, Jesper Krogh, Linux Kernel Mailing List


* Ingo Molnar <mingo@elte.hu> wrote:

> Oh, and while at it - also a job control complaint. I tried to 
> Ctrl-C the above script:
> 
> file # 858 (253560 bytes), reading it took: 0.06 seconds.
> file # 859 (253560 bytes), reading it took: 0.02 seconds.
> file # 860 (253560 bytes), reading it took: 5.53 seconds.
> file # 861 (253560 bytes), reading it took: 3.70 seconds.
> file # 862 (253560 bytes), reading it took: 0.88 seconds.
> file # 863 (253560 bytes), reading it took: 0.04 seconds.
> file # 864 (253560 bytes), reading it took: ^C0.69 seconds.
> file # 865 (253560 bytes), reading it took: ^C0.49 seconds.
> file # 866 (253560 bytes), reading it took: ^C0.01 seconds.
> file # 867 (253560 bytes), reading it took: ^C0.02 seconds.
> file # 868 (253560 bytes), reading it took: ^C^C0.01 seconds.
> file # 869 (253560 bytes), reading it took: ^C^C0.04 seconds.
> file # 870 (253560 bytes), reading it took: ^C^C^C0.03 seconds.
> file # 871 (253560 bytes), reading it took: ^C0.02 seconds.
> file # 872 (253560 bytes), reading it took: ^C^C0.02 seconds.
> file # 873 (253560 bytes), reading it took: 
> ^C^C^C^Caldebaran:~/linux/linux/test-files/src> 
> 
> I had to hit Ctrl-C numerous times before Bash would honor it. 
> This to is a very common thing on large SMP systems. [...]

It happened when i tried to Ctrl-C the compile job as well:

Thu Mar 26 11:04:05 CET 2009
Thu Mar 26 11:06:30 CET 2009
^CThu Mar 26 11:07:55 CET 2009
^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^Caldebaran:/home/mingo/linux/linux> 

a single Ctrl-C is rarely enough to stop busy Bash shell scripts on 
Linux. Why?

	Ingo

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Revert "gro: Fix legacy path napi_complete crash",
  2009-03-26  3:20                           ` David Miller
  2009-03-26  3:40                             ` Herbert Xu
@ 2009-03-26  9:18                             ` Jarek Poplawski
  1 sibling, 0 replies; 664+ messages in thread
From: Jarek Poplawski @ 2009-03-26  9:18 UTC (permalink / raw)
  To: David Miller
  Cc: herbert, mingo, r.schwebel, torvalds, blaschka, tglx,
	a.p.zijlstra, linux-kernel, kernel, Adam Richter, Sascha Hauer

On Wed, Mar 25, 2009 at 08:20:50PM -0700, David Miller wrote:
...
> Adam Richter has successfully tested Jarek's variant, and if Ingo's
> tests show that it makes his problem go away too then I'm definitely
> going with Jarek's patch.

I hope this patch version on top of Herbert's revert could be useful
then.

Thanks,
Jarek P.
-------------------->
GRO: Re-enable GRO on legacy netif_rx path.

Replacing __napi_complete() with napi_complete() in process_backlog()
caused various breaks for non-napi NICs. It looks like the following
scenario can happen:

process_backlog()			netif_rx()

if (!skb)
local_irq_enable()
					if (queue.qlen) //NO
					napi_schedule() //NOTHING
					__skb_queue_tail() //qlen > 0
napi_complete()
...					...
					Every next netif_rx() sees
					qlen > 0, so napi is never
					scheduled again.

This patch fixes it by open-coding napi_complete() with additional
check for empty queue before __napi_complete().

With help of debugging by Ingo Molnar, Sascha Hauer and Herbert Xu.

Reported-by: Ingo Molnar <mingo@elte.hu>
Tested-by: Adam Richter <adam_richter2004@yahoo.com>
Signed-off-by: Jarek Poplawski <jarkao2@gmail.com>
---
(vs. 2.6.29 with Herbert's "GRO: Disable GRO on legacy netif_rx path")

diff -Nurp a/net/core/dev.c b/net/core/dev.c
--- a/net/core/dev.c	2009-03-26 08:32:51.000000000 +0000
+++ b/net/core/dev.c	2009-03-26 08:37:58.000000000 +0000
@@ -2588,15 +2588,22 @@ static int process_backlog(struct napi_s
 		local_irq_disable();
 		skb = __skb_dequeue(&queue->input_pkt_queue);
 		if (!skb) {
-			__napi_complete(napi);
 			local_irq_enable();
-			break;
+			napi_gro_flush(napi);
+			local_irq_disable();
+			if (skb_queue_empty(&queue->input_pkt_queue))
+				__napi_complete(napi);
+			local_irq_enable();
+			goto out;
 		}
 		local_irq_enable();
 
-		netif_receive_skb(skb);
+		napi_gro_receive(napi, skb);
 	} while (++work < quota && jiffies == start_time);
 
+	napi_gro_flush(napi);
+
+out:
 	return work;
 }
 

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Revert "gro: Fix legacy path napi_complete crash",
  2009-03-26  9:07                         ` Herbert Xu
@ 2009-03-26  9:25                           ` Ingo Molnar
  0 siblings, 0 replies; 664+ messages in thread
From: Ingo Molnar @ 2009-03-26  9:25 UTC (permalink / raw)
  To: Herbert Xu
  Cc: David Miller, r.schwebel, torvalds, blaschka, tglx, a.p.zijlstra,
	linux-kernel, kernel


* Herbert Xu <herbert@gondor.apana.org.au> wrote:

> On Wed, Mar 25, 2009 at 11:01:49PM +0100, Ingo Molnar wrote:
> > 
> > Sure, can try that. Probably the best would be if you sent me a 
> > combo patch with the precise patch you meant me to try (there were 
> > several patches, i'm not sure which one is the 'previous' one) plus 
> > the forcedeth debug enable change as well.
> 
> Sure, here's the patch to do both.

> diff --git a/drivers/net/forcedeth.c b/drivers/net/forcedeth.c
> index b8251e8..101e552 100644
> --- a/drivers/net/forcedeth.c
> +++ b/drivers/net/forcedeth.c
> @@ -64,7 +64,7 @@
>  #include <asm/uaccess.h>
>  #include <asm/system.h>
>  
> -#if 0
> +#if 1
>  #define dprintk			printk
>  #else
>  #define dprintk(x...)		do { } while (0)

The box is not even able to boot up, so many messages are printed.

	Ingo

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: ext3 IO latency measurements (was: Linux 2.6.29)
  2009-03-26  9:06                               ` ext3 IO latency measurements (was: Linux 2.6.29) Ingo Molnar
  2009-03-26  9:09                                 ` Ingo Molnar
@ 2009-03-26 11:08                                 ` Jens Axboe
  2009-03-26 14:27                                   ` Arjan van de Ven
  2009-03-26 11:37                                 ` Theodore Tso
                                                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 664+ messages in thread
From: Jens Axboe @ 2009-03-26 11:08 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Jan Kara, Linus Torvalds, Theodore Tso, Andrew Morton, Alan Cox,
	Arjan van de Ven, Peter Zijlstra, Nick Piggin, David Rees,
	Jesper Krogh, Linux Kernel Mailing List, Oleg Nesterov,
	Roland McGrath

On Thu, Mar 26 2009, Ingo Molnar wrote:
> And it's not just sys_fsync(). The script i wrote tests file read 
> latencies. I have created 1000 files with the same size (all copies 
> of kernel/sched.c ;-), and tested their cache-cold plain-cat 
> performance via:
> 
>   for ((i=0;i<1000;i++)); do
>     printf "file #%4d, plain reading it took: " $i
>     /usr/bin/time -f "%e seconds."  cat $i >/dev/null
>   done
> 
> I.e. plain, supposedly high-prio reads. The result is very common 
> hickups in read latencies:
> 
> file # 579 (253560 bytes), reading it took: 0.08 seconds.
> file # 580 (253560 bytes), reading it took: 0.05 seconds.
> file # 581 (253560 bytes), reading it took: 0.01 seconds.
> file # 582 (253560 bytes), reading it took: 0.01 seconds.
> file # 583 (253560 bytes), reading it took: 4.61 seconds.
> file # 584 (253560 bytes), reading it took: 1.29 seconds.
> file # 585 (253560 bytes), reading it took: 3.01 seconds.
> file # 586 (253560 bytes), reading it took: 7.74 seconds.
> file # 587 (253560 bytes), reading it took: 3.22 seconds.
> file # 588 (253560 bytes), reading it took: 0.05 seconds.
> file # 589 (253560 bytes), reading it took: 0.36 seconds.
> file # 590 (253560 bytes), reading it took: 7.39 seconds.
> file # 591 (253560 bytes), reading it took: 7.58 seconds.
> file # 592 (253560 bytes), reading it took: 7.90 seconds.
> file # 593 (253560 bytes), reading it took: 8.78 seconds.
> file # 594 (253560 bytes), reading it took: 8.01 seconds.
> file # 595 (253560 bytes), reading it took: 7.47 seconds.
> file # 596 (253560 bytes), reading it took: 11.52 seconds.
> file # 597 (253560 bytes), reading it took: 10.33 seconds.
> file # 598 (253560 bytes), reading it took: 8.56 seconds.
> file # 599 (253560 bytes), reading it took: 7.58 seconds.

Did you capture the trace of the long delays in the read test case? It
can be two things, at least. One is that each little read takes much
longer than it should, the other is that we get stuck waiting on a dirty
page and hence that slows down the reads a lot.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: ext3 IO latency measurements (was: Linux 2.6.29)
  2009-03-26  9:06                               ` ext3 IO latency measurements (was: Linux 2.6.29) Ingo Molnar
  2009-03-26  9:09                                 ` Ingo Molnar
  2009-03-26 11:08                                 ` Jens Axboe
@ 2009-03-26 11:37                                 ` Theodore Tso
  2009-03-26 12:44                                   ` Ingo Molnar
                                                     ` (3 more replies)
  2009-03-26 12:22                                 ` Pekka Enberg
                                                   ` (3 subsequent siblings)
  6 siblings, 4 replies; 664+ messages in thread
From: Theodore Tso @ 2009-03-26 11:37 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Jan Kara, Linus Torvalds, Andrew Morton, Alan Cox,
	Arjan van de Ven, Peter Zijlstra, Nick Piggin, Jens Axboe,
	David Rees, Jesper Krogh, Linux Kernel Mailing List,
	Oleg Nesterov, Roland McGrath

Ingo,

Interesting.  I wonder if the problem is the journal is cycling fast
enough that it is checkpointing all the time.  If so, it could be that
a bigger-sized journal might help.  Can you try this as an experiment?
Mount the filesystem using ext4, with the mount option nodelalloc.
With an filesystem formatted as ext3, and with delayed allocation
disabled, it should behave mostly the same as ext3; try and make sure
you're still seeing the same problems.

Then could you grab /proc/fs/jbd2/<dev>:8/history and
/proc/fs/jbd2/<dev>:8/info while running your test workload?

Also, can you send me the output of "dumpe2fs -h /dev/sdXX | grep Journal"?

> Oh, and while at it - also a job control complaint. I tried to 
> Ctrl-C the above script:
> 
> I had to hit Ctrl-C numerous times before Bash would honor it. This 
> to is a very common thing on large SMP systems.

Well, the script you sent runs the compile in the background.  It did:

>   while :; do
>     date
>     make mrproper      2>/dev/null >/dev/null
>     make defconfig     2>/dev/null >/dev/null
>     make -j32 bzImage  2>/dev/null >/dev/null
>   done &
         ^^

So there would have been nothing to ^C; I assume you were running this
with a variant that didn't have the ampersand, which would have run
the whole shell pipeline in a detached background process?

In any case, the workaround for this is to ^Z the script, and then
"kill %" it.

I'm pretty sure this is actually a bash problem.  When you send a
Ctrl-C, it sends a SIGINT to all of the members of the tty's
foreground process group.  Under some circumstances, bash sets the
signal handler for SIGINT to be SIGIGN.  I haven't looked at this
super closely (it would require diving into the bash sources), but you
can see it if you attach an strace to the bash shell driving a script
such as

#!/bin/bash

while /bin/true; do
      date
      sleep 60
done &

If you do a "ps axo pid,ppid,pgrp,args", you'll see that the bash and
the sleep 60 have the same process group.  If you emulate hitting ^C
by sending a SIGINT to pid of the shell, you'll see that it ignores
it.  Sleep also seems to be ignoring the SIGINT when run in the
background; but it does honor SIGINT in the foreground --- I didn't
have time to dig into that.

In any case, bash appears to SIGIGN the INT signal if there is a child
process running, and only takes the ^C if bash itself is actually
"running" the shell script.  For example, if you run the command
"date;sleep 10;date;sleep 10;date", the ^C only interrupts the sleep
command.  It doesn't stop the series of commands which bash is
running.

						- Ted


^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: ext3 IO latency measurements (was: Linux 2.6.29)
  2009-03-26  9:06                               ` ext3 IO latency measurements (was: Linux 2.6.29) Ingo Molnar
                                                   ` (2 preceding siblings ...)
  2009-03-26 11:37                                 ` Theodore Tso
@ 2009-03-26 12:22                                 ` Pekka Enberg
  2009-03-26 12:23                                   ` Pekka Enberg
  2009-03-26 14:38                                 ` Andrew Morton
                                                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 664+ messages in thread
From: Pekka Enberg @ 2009-03-26 12:22 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Jan Kara, Linus Torvalds, Theodore Tso, Andrew Morton, Alan Cox,
	Arjan van de Ven, Peter Zijlstra, Nick Piggin, Jens Axboe,
	David Rees, Jesper Krogh, Linux Kernel Mailing List,
	Oleg Nesterov, Roland McGrath

Hi Ingo,

On Thu, Mar 26, 2009 at 11:06 AM, Ingo Molnar <mingo@elte.hu> wrote:
> Btw., i had this test going on that box while i wrote some simple
> scripts in Vim - and it was a horrible experience. The worst wait
> was well above one minute - Vim just hung there indefinitely. Not
> even Ctrl-Z was possible. I captured one such wait, it was hanging
> right here:

Just a data point: I've seen this exact same time for a long time (1-2
years) too even with stock distribution kernels. Never bothered to
investigate it though.

                        Pekka

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: ext3 IO latency measurements (was: Linux 2.6.29)
  2009-03-26 12:22                                 ` Pekka Enberg
@ 2009-03-26 12:23                                   ` Pekka Enberg
  0 siblings, 0 replies; 664+ messages in thread
From: Pekka Enberg @ 2009-03-26 12:23 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Jan Kara, Linus Torvalds, Theodore Tso, Andrew Morton, Alan Cox,
	Arjan van de Ven, Peter Zijlstra, Nick Piggin, Jens Axboe,
	David Rees, Jesper Krogh, Linux Kernel Mailing List,
	Oleg Nesterov, Roland McGrath

On Thu, Mar 26, 2009 at 11:06 AM, Ingo Molnar <mingo@elte.hu> wrote:
>> Btw., i had this test going on that box while i wrote some simple
>> scripts in Vim - and it was a horrible experience. The worst wait
>> was well above one minute - Vim just hung there indefinitely. Not
>> even Ctrl-Z was possible. I captured one such wait, it was hanging
>> right here:

On Thu, Mar 26, 2009 at 2:22 PM, Pekka Enberg <penberg@cs.helsinki.fi> wrote:
> Just a data point: I've seen this exact same time for a long time (1-2

s/same time/same thing/

> years) too even with stock distribution kernels. Never bothered to
> investigate it though.

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: ext3 IO latency measurements (was: Linux 2.6.29)
  2009-03-26 11:37                                 ` Theodore Tso
@ 2009-03-26 12:44                                   ` Ingo Molnar
  2009-03-26 12:46                                   ` Ingo Molnar
                                                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 664+ messages in thread
From: Ingo Molnar @ 2009-03-26 12:44 UTC (permalink / raw)
  To: Theodore Tso, Jan Kara, Linus Torvalds, Andrew Morton, Alan Cox,
	Arjan van de Ven, Peter Zijlstra, Nick Piggin, Jens Axboe,
	David Rees, Jesper Krogh, Linux Kernel Mailing List,
	Oleg Nesterov, Roland McGrath

[-- Attachment #1: Type: text/plain, Size: 2139 bytes --]


* Theodore Tso <tytso@mit.edu> wrote:

> >   while :; do
> >     date
> >     make mrproper      2>/dev/null >/dev/null
> >     make defconfig     2>/dev/null >/dev/null
> >     make -j32 bzImage  2>/dev/null >/dev/null
> >   done &
>          ^^
> 
> So there would have been nothing to ^C; I assume you were running 
> this with a variant that didn't have the ampersand, which would 
> have run the whole shell pipeline in a detached background 
> process?

That was just the example - the real script did not go into the 
background so it was Ctrl-C-able. I've attached it.

> In any case, the workaround for this is to ^Z the script, and then 
> "kill %" it.
> 
> I'm pretty sure this is actually a bash problem.  When you send a 
> Ctrl-C, it sends a SIGINT to all of the members of the tty's 
> foreground process group.  Under some circumstances, bash sets the 
> signal handler for SIGINT to be SIGIGN.  I haven't looked at this 
> super closely (it would require diving into the bash sources), but 
> you can see it if you attach an strace to the bash shell driving a 
> script such as
> 
> #!/bin/bash
> 
> while /bin/true; do
>       date
>       sleep 60
> done &
> 
> If you do a "ps axo pid,ppid,pgrp,args", you'll see that the bash 
> and the sleep 60 have the same process group.  If you emulate 
> hitting ^C by sending a SIGINT to pid of the shell, you'll see 
> that it ignores it.  Sleep also seems to be ignoring the SIGINT 
> when run in the background; but it does honor SIGINT in the 
> foreground --- I didn't have time to dig into that.
> 
> In any case, bash appears to SIGIGN the INT signal if there is a 
> child process running, and only takes the ^C if bash itself is 
> actually "running" the shell script.  For example, if you run the 
> command "date;sleep 10;date;sleep 10;date", the ^C only interrupts 
> the sleep command.  It doesn't stop the series of commands which 
> bash is running.

It happens all the time - and it does look like a Bash bug. I 
reported it to the Bash maintainer one or two years ago. He said
he does not see it as he's using MacOS X. Can dig into archives if 
needed.

	Ingo

[-- Attachment #2: compile-test --]
[-- Type: text/plain, Size: 321 bytes --]


  #
  # Cool down caches as much as possible:
  #
  sync
  echo 3 > /proc/sys/vm/drop_caches

  #
  # Kernel developer builds his kernel workload:
  #
  while :; do
    date
    make mrproper      2>/dev/null >/dev/null
    make defconfig     2>/dev/null >/dev/null
    make -j32 bzImage  2>/dev/null >/dev/null
  done


^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: ext3 IO latency measurements (was: Linux 2.6.29)
  2009-03-26 11:37                                 ` Theodore Tso
  2009-03-26 12:44                                   ` Ingo Molnar
@ 2009-03-26 12:46                                   ` Ingo Molnar
  2009-03-26 14:03                                   ` Ingo Molnar
  2009-03-31 11:51                                   ` Neil Brown
  3 siblings, 0 replies; 664+ messages in thread
From: Ingo Molnar @ 2009-03-26 12:46 UTC (permalink / raw)
  To: Theodore Tso, Jan Kara, Linus Torvalds, Andrew Morton, Alan Cox,
	Arjan van de Ven, Peter Zijlstra, Nick Piggin, Jens Axboe,
	David Rees, Jesper Krogh, Linux Kernel Mailing List,
	Oleg Nesterov, Roland McGrath


* Theodore Tso <tytso@mit.edu> wrote:

> Also, can you send me the output of "dumpe2fs -h /dev/sdXX | grep Journal"?

Sure:

 [root@aldebaran ~]# df
 Filesystem           1K-blocks      Used Available Use% Mounted on
 /dev/sda1             40313964  11595520  26670560  31% /
 /dev/sda2            403160732  35165460 347515212  10% /home
 tmpfs                  6159096        48   6159048   1% /dev/shm

 [root@aldebaran ~]# dumpe2fs -h /dev/sda2 | grep Journal
 dumpe2fs 1.40.8 (13-Mar-2008)
 Journal inode:            8
 Journal backup:           inode blocks
 Journal size:             128M

Stock Fedora 9 release/install, updated, and booted to 2.6.29.

	Ingo

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-26  6:24                                         ` Jeff Garzik
@ 2009-03-26 12:49                                           ` Kyle Moffett
  0 siblings, 0 replies; 664+ messages in thread
From: Kyle Moffett @ 2009-03-26 12:49 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Linus Torvalds, Matthew Garrett, Theodore Tso, Christoph Hellwig,
	Jan Kara, Andrew Morton, Ingo Molnar, Alan Cox, Arjan van de Ven,
	Peter Zijlstra, Nick Piggin, Jens Axboe, David Rees,
	Jesper Krogh, Linux Kernel Mailing List

On Thu, Mar 26, 2009 at 2:24 AM, Jeff Garzik <jeff@garzik.org> wrote:
> Kyle Moffett wrote:
>> Really, I think virtually all of the database programs would be
>> perfectly happy with an "fsbarrier(fd, flags)" syscall, where if "fd"
>> points to a regular file or directory then it instructs the underlying
>> filesystem to do whatever internal barrier it supports, and if not
>> just fail with -ENOTSUPP (so you can fall back to fdatasync(), etc).
>> Perhaps "flags" would allow a "data" or "metadata" barrier, but if not
>> it's not a big issue.
>
> If you want a per-fd barrier call, there is always sync_file_range(2)

The issue is that sync_file_range doesn't seem to be documented to
have any inter-file barrier semantics.  Even then, from the manpage it
doesn't look like
write(fd)+sync_file_range(fd,SYNC_FILE_RANGE_WRITE)+write(fd) would
actually prevent the second write from occurring before the first has
actually hit disk (assuming both are within the specified range).

Cheers,
Kyle Moffett

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: ext3 IO latency measurements (was: Linux 2.6.29)
  2009-03-26 11:37                                 ` Theodore Tso
  2009-03-26 12:44                                   ` Ingo Molnar
  2009-03-26 12:46                                   ` Ingo Molnar
@ 2009-03-26 14:03                                   ` Ingo Molnar
  2009-03-26 14:13                                     ` Ingo Molnar
                                                       ` (3 more replies)
  2009-03-31 11:51                                   ` Neil Brown
  3 siblings, 4 replies; 664+ messages in thread
From: Ingo Molnar @ 2009-03-26 14:03 UTC (permalink / raw)
  To: Theodore Tso, Jan Kara, Linus Torvalds, Andrew Morton, Alan Cox,
	Arjan van de Ven, Peter Zijlstra, Nick Piggin, Jens Axboe,
	David Rees, Jesper Krogh, Linux Kernel Mailing List,
	Oleg Nesterov, Roland McGrath


* Theodore Tso <tytso@mit.edu> wrote:

> Ingo,
> 
> Interesting.  I wonder if the problem is the journal is cycling 
> fast enough that it is checkpointing all the time.  If so, it 
> could be that a bigger-sized journal might help.  Can you try this 
> as an experiment? Mount the filesystem using ext4, with the mount 
> option nodelalloc. With an filesystem formatted as ext3, and with 
> delayed allocation disabled, it should behave mostly the same as 
> ext3; try and make sure you're still seeing the same problems.
> 
> Then could you grab /proc/fs/jbd2/<dev>:8/history and
> /proc/fs/jbd2/<dev>:8/info while running your test workload?

i tried it:

  /dev/sda2 on /home type ext4 (rw,nodelalloc)

I still see similarly bad latencies in Vim:

aldebaran:~> cat /proc/10227/stack
[<ffffffff80370cad>] jbd2_log_wait_commit+0xbd/0x110
[<ffffffff8036bc70>] jbd2_journal_stop+0x1f3/0x221
[<ffffffff8036ccb0>] jbd2_journal_force_commit+0x28/0x2c
[<ffffffff80352660>] ext4_force_commit+0x2e/0x34
[<ffffffff80346682>] ext4_write_inode+0x3e/0x44
[<ffffffff802eb941>] __sync_single_inode+0xc1/0x2ad
[<ffffffff802ebc7a>] __writeback_single_inode+0x14d/0x15a
[<ffffffff802ebcb0>] sync_inode+0x29/0x34
[<ffffffff80343e16>] ext4_sync_file+0xf6/0x138
[<ffffffff802eef21>] vfs_fsync+0x78/0xaf
[<ffffffff802eef8f>] do_fsync+0x37/0x4d
[<ffffffff802eefcc>] sys_fsync+0x10/0x14
[<ffffffff8020bd1b>] system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff

Vim is still almost unusable during this workload - even if i dont 
write out the source file just use it interactively to edit it.

The read-test is somewhat better. There are occasional blips of 4-5 
seconds:

 file # 928 (253560 bytes), reading it took: 0.76 seconds.
 file # 929 (253560 bytes), reading it took: 3.98 seconds.
 file # 930 (253560 bytes), reading it took: 3.45 seconds.
 file # 931 (253560 bytes), reading it took: 0.04 seconds.

I have also written a 'vim open' test which does vim -c q, i.e. it 
just opens a source file and closes it without writing the file. 
That too takes a lot of time:

file #   0 (253560 bytes), Vim-opening it took: 2.04 seconds.
file #   1 (253560 bytes), Vim-opening it took: 2.39 seconds.
file #   2 (253560 bytes), Vim-opening it took: 2.03 seconds.
file #   3 (253560 bytes), Vim-opening it took: 2.81 seconds.
file #   4 (253560 bytes), Vim-opening it took: 2.11 seconds.
file #   5 (253560 bytes), Vim-opening it took: 2.44 seconds.
file #   6 (253560 bytes), Vim-opening it took: 2.04 seconds.
file #   7 (253560 bytes), Vim-opening it took: 3.59 seconds.
file #   8 (253560 bytes), Vim-opening it took: 2.06 seconds.
file #   9 (253560 bytes), Vim-opening it took: 3.26 seconds.
file #  10 (253560 bytes), Vim-opening it took: 2.04 seconds.
file #  11 (253560 bytes), Vim-opening it took: 2.38 seconds.
file #  12 (253560 bytes), Vim-opening it took: 2.04 seconds.
file #  13 (253560 bytes), Vim-opening it took: 3.05 seconds.

Here's a few snapshots of Vim waiting spots:

aldebaran:~> cat /proc/$(ps aux | grep -m 1 'vim -c' | cut -d' ' 
-f5)/stack
[<ffffffff8036c1ae>] do_get_write_access+0x22b/0x452
[<ffffffff8036c3fc>] jbd2_journal_get_write_access+0x27/0x38
[<ffffffff8035aa8c>] __ext4_journal_get_write_access+0x51/0x59
[<ffffffff80346f30>] ext4_reserve_inode_write+0x3d/0x79
[<ffffffff80346f9f>] ext4_mark_inode_dirty+0x33/0x187
[<ffffffff8034724e>] ext4_dirty_inode+0x6a/0x9f
[<ffffffff802ec4eb>] __mark_inode_dirty+0x38/0x199
[<ffffffff802e27f5>] touch_atime+0xf6/0x101
[<ffffffff8029cc83>] do_generic_file_read+0x37c/0x3c7
[<ffffffff8029d770>] generic_file_aio_read+0x15b/0x197
[<ffffffff802d0816>] do_sync_read+0xec/0x132
[<ffffffff802d11de>] vfs_read+0xb0/0x139
[<ffffffff802d1335>] sys_read+0x4c/0x74
[<ffffffff8020bd1b>] system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff

aldebaran:~> cat /proc/$(ps aux | grep -m 1 'vim -c' | cut -d' ' 
-f5)/stack
[<ffffffff8029c0ed>] sync_page+0x41/0x45
[<ffffffff8029c274>] wait_on_page_bit+0x73/0x7a
[<ffffffff802a5a76>] truncate_inode_pages_range+0x2f6/0x37b
[<ffffffff802a5b0d>] truncate_inode_pages+0x12/0x15
[<ffffffff8034b97b>] ext4_delete_inode+0x6a/0x25f
[<ffffffff802e378e>] generic_delete_inode+0xe7/0x174
[<ffffffff802e382f>] generic_drop_inode+0x14/0x1d
[<ffffffff802e2866>] iput+0x66/0x6a
[<ffffffff802db889>] do_unlinkat+0x107/0x15d
[<ffffffff802db8f5>] sys_unlink+0x16/0x18
[<ffffffff8020bd1b>] system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff

aldebaran:~> cat /proc/$(ps aux | grep -m 1 'vim -c' | cut -d' ' 
-f5)/stack
[<ffffffff8036c1ae>] do_get_write_access+0x22b/0x452
[<ffffffff8036c3fc>] jbd2_journal_get_write_access+0x27/0x38
[<ffffffff8035aa8c>] __ext4_journal_get_write_access+0x51/0x59
[<ffffffff80346f30>] ext4_reserve_inode_write+0x3d/0x79
[<ffffffff80346f9f>] ext4_mark_inode_dirty+0x33/0x187
[<ffffffff8034724e>] ext4_dirty_inode+0x6a/0x9f
[<ffffffff802ec4eb>] __mark_inode_dirty+0x38/0x199
[<ffffffff802e27f5>] touch_atime+0xf6/0x101
[<ffffffff8029cc83>] do_generic_file_read+0x37c/0x3c7
[<ffffffff8029d770>] generic_file_aio_read+0x15b/0x197
[<ffffffff802d0816>] do_sync_read+0xec/0x132
[<ffffffff802d11de>] vfs_read+0xb0/0x139
[<ffffffff802d1335>] sys_read+0x4c/0x74
[<ffffffff8020bd1b>] system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff

aldebaran:~> cat /proc/$(ps aux | grep -m 1 'vim -c' | cut -d' ' 
-f5)/stack
[<ffffffff8036c1ae>] do_get_write_access+0x22b/0x452
[<ffffffff8036c3fc>] jbd2_journal_get_write_access+0x27/0x38
[<ffffffff8035aa8c>] __ext4_journal_get_write_access+0x51/0x59
[<ffffffff80346f30>] ext4_reserve_inode_write+0x3d/0x79
[<ffffffff80346f9f>] ext4_mark_inode_dirty+0x33/0x187
[<ffffffff8034724e>] ext4_dirty_inode+0x6a/0x9f
[<ffffffff802ec4eb>] __mark_inode_dirty+0x38/0x199
[<ffffffff802e27f5>] touch_atime+0xf6/0x101
[<ffffffff8029cc83>] do_generic_file_read+0x37c/0x3c7
[<ffffffff8029d770>] generic_file_aio_read+0x15b/0x197
[<ffffffff802d0816>] do_sync_read+0xec/0x132
[<ffffffff802d11de>] vfs_read+0xb0/0x139
[<ffffffff802d1335>] sys_read+0x4c/0x74
[<ffffffff8020bd1b>] system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff

aldebaran:~> cat /proc/$(ps aux | grep -m 1 'vim -c' | cut -d' ' 
-f5)/stack
[<ffffffff8036c1ae>] do_get_write_access+0x22b/0x452
[<ffffffff8036c3fc>] jbd2_journal_get_write_access+0x27/0x38
[<ffffffff8035aa8c>] __ext4_journal_get_write_access+0x51/0x59
[<ffffffff80346f30>] ext4_reserve_inode_write+0x3d/0x79
[<ffffffff80346f9f>] ext4_mark_inode_dirty+0x33/0x187
[<ffffffff8034724e>] ext4_dirty_inode+0x6a/0x9f
[<ffffffff802ec4eb>] __mark_inode_dirty+0x38/0x199
[<ffffffff802e27f5>] touch_atime+0xf6/0x101
[<ffffffff8029cc83>] do_generic_file_read+0x37c/0x3c7
[<ffffffff8029d770>] generic_file_aio_read+0x15b/0x197
[<ffffffff802d0816>] do_sync_read+0xec/0x132
[<ffffffff802d11de>] vfs_read+0xb0/0x139
[<ffffffff802d1335>] sys_read+0x4c/0x74
[<ffffffff8020bd1b>] system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff

That's in good deal atime update latencies. We still appear to 
default to atime enabled in ext4.

That's stupid - only around 0.01% of all Linux systems relies on 
atime - and even those who rely on it would be well served by 
relatime. Why arent the relatime patches upstream? Why isnt it the 
default? They have been submitted several times.

Atime in its current mandatory do-a-write-for-every-read form is a 
stupid relic and we have been paying the fool's tax for it in the 
past 10 years.

	Ingo

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: ext3 IO latency measurements (was: Linux 2.6.29)
  2009-03-26 14:03                                   ` Ingo Molnar
@ 2009-03-26 14:13                                     ` Ingo Molnar
  2009-03-26 14:30                                     ` Andrew Morton
                                                       ` (2 subsequent siblings)
  3 siblings, 0 replies; 664+ messages in thread
From: Ingo Molnar @ 2009-03-26 14:13 UTC (permalink / raw)
  To: Theodore Tso, Jan Kara, Linus Torvalds, Andrew Morton, Alan Cox,
	Arjan van de Ven, Peter Zijlstra, Nick Piggin, Jens Axboe,
	David Rees, Jesper Krogh, Linux Kernel Mailing List,
	Oleg Nesterov, Roland McGrath


> > Then could you grab /proc/fs/jbd2/<dev>:8/history and 
> > /proc/fs/jbd2/<dev>:8/info while running your test workload?

grabbed them after the tests. (not much else ran before and after 
the tests) Here they are:

R/C  tid   wait  run   lock  flush log   hndls  block inlog ctime write drop  close
R    642267 0     5000  0     84    3288  76104  560   562  
R    642268 0     5000  956   2024  5760  68891  491   493  
R    642269 956   7788  8     5104  6696  182270 667   669  
R    642270 8     11800 216   7000  6816  186159 834   837  
R    642271 60    13816 0     0     492   45115  2162  2169 
R    642272 0     5000  0     0     2144  44278  1266  1270 
R    642273 0     5000  0     80    3144  73604  444   446  
R    642274 0     5000  0     276   3120  71741  488   490  
R    642275 0     5000  0     288   3608  87334  526   528  
R    642276 0     5000  0     112   2992  83061  512   514  
R    642277 0     5000  0     84    5892  75029  468   470  
R    642278 0     5976  0     848   8564  71693  483   485  
R    642279 0     9412  340   5432  7104  167415 664   666  
R    642280 340   12536 0     8764  3820  270409 906   909  
R    642281 0     12584 0     0     576   38603  2175  2182 
R    642282 0     5000  0     16    2620  51638  1275  1279 
R    642283 0     5000  0     16    2364  58962  376   378  
R    642284 0     5000  0     56    2812  66644  442   444  
R    642285 0     5000  0     64    2744  61323  479   481  
R    642286 0     5000  0     16    2328  61109  439   441  
R    642287 0     5000  0     40    2752  69227  471   473  
R    642288 0     5000  0     20    2536  60836  454   456  
R    642289 0     5000  0     16    2612  63580  440   442  
R    642290 0     5000  0     48    2528  72629  463   465  
R    642291 0     5000  0     68    2848  75262  498   500  
R    642292 0     5000  0     60    2688  77164  468   470  
R    642293 0     5000  0     0     2188  60922  458   460  
R    642294 0     5000  0     348   3124  79928  528   530  
R    642295 0     5100  0     1896  3128  62695  672   674  
R    642296 0     5024  0     8     4840  17110  90    91   
R    642297 0     5000  0     0     816   14232  1534  1539 
R    642298 0     5000  0     0     2636  52309  2070  2077 
R    642299 0     5000  0     60    2936  67005  375   377  
R    642300 0     5000  0     0     2064  52821  374   376  
R    642301 0     5000  0     228   3344  85321  539   541  
R    642302 0     5000  0     24    5360  54421  458   460  
R    642303 0     5384  52    1388  6548  66593  452   454  
R    642304 52    7936  876   7316  3832  192877 640   642  
R    642305 876   11152 0     5620  3656  262638 821   824  
R    642306 0     9276  0     6172  3768  125379 806   809  
R    642307 0     9920  0     0     6256  11091  702   705  
R    642308 0     5636  0     0     4976  266    119   120  
R    642309 0     5004  0     0     432   11035  1894  1900 
R    642310 0     5000  0     0     2608  43443  1270  1274 
R    642311 0     5000  0     36    2204  55368  333   334  
R    642312 0     5000  0     224   3180  77649  476   478  
R    642313 0     5000  0     20    2496  65989  431   433  
R    642314 0     5000  0     36    2512  71443  466   468  
R    642315 0     5000  956   1796  3916  62650  460   462  
R    642316 956   5712  0     1608  3676  136304 581   583  
R    642317 0     5284  0     1732  3704  123922 572   574  
R    642318 0     5436  0     2032  3652  133710 610   612  
R    642319 0     5684  0     1496  3652  129017 622   624  
R    642320 0     5148  0     640   3304  36529  602   604  
R    642321 0     5004  0     580   5404  23212  216   217  
R    642322 0     5980  0     0     272   4878   149   150  
R    642323 0     5000  0     0     1408  32432  2803  2812 
R    642324 0     5000  0     1184  3416  94751  420   422  
R    642325 0     5000  0     592   3260  94208  502   504  
R    642326 0     5000  0     92    3648  88419  486   488  
R    642327 0     5000  0     80    3052  76222  489   491  
R    642328 0     5000  0     0     4460  60005  429   431  
R    642329 0     5000  1108  2408  6260  77172  426   428  
R    642330 1108  8668  4     4992  3356  166166 622   624  
R    642331 0     8340  0     0     120   75     62    63   
R    642332 0     6984  0     0     68    51     14    15   
R    642333 0     6072  0     0     72    35     14    15   
R    642334 0     6072  0     0     80    31     12    13   
R    642335 0     5732  0     0     84    49     15    16   
R    642336 0     5684  0     0     68    31     12    13   
R    642337 0     5612  0     0     72    36     14    15   
R    642338 0     5636  0     0     68    35     15    16   
R    642339 0     5564  0     0     48    8      3     4    
R    642340 0     6224  0     0     4136  7      8     9    
R    642341 0     5596  0     0     100   11     13    14   
R    642342 0     5660  0     0     156   31     11    12   
R    642343 0     5752  0     0     76    24     12    13   
R    642344 0     5088  0     0     60    12     4     5    
R    642345 0     5032  0     0     40    8      3     4    
R    642346 0     5440  0     0     36    16     4     5    
R    642347 0     8616  0     0     44    13     4     5    
R    642348 0     5284  0     0     44    7      4     5    
R    642349 0     5444  0     0     56    8      4     5    
R    642350 0     5292  0     0     36    8      4     5    
R    642351 0     5140  0     0     44    4      3     4    
R    642352 0     5452  0     0     80    26     7     8    
R    642353 0     11144 0     0     108   5      3     4    
R    642354 0     5692  0     0     36    5      3     4    
R    642355 0     5520  0     0     40    4      3     4    
R    642356 0     5376  0     0     40    5      3     4    
R    642357 0     5228  0     0     44    4      3     4    
R    642358 0     5080  0     0     44    5      3     4    
R    642359 0     17148 0     0     48    6      3     4    
R    642360 0     18148 0     0     44    4      3     4    
R    642361 0     5064  0     0     64    12     4     5    
R    642362 0     5560  0     0     64    33     13    14   
R    642363 0     5228  0     0     36    7      4     5    
R    642364 0     5096  0     0     40    3      3     4    
R    642365 0     8136  0     0     64    19     8     9    
R    642366 0     9140  0     0     64    9      4     5    

312 transaction, each upto 8192 blocks
average: 
  52ms waiting for transaction
  5692ms running transaction
  52ms transaction was being locked
  728ms flushing data (in ordered mode)
  2748ms logging transaction
  51724us average transaction commit time
  61570 handles per transaction
  565 blocks per transaction
  567 logged blocks per transaction

(Hopefully i captured them the right way - should i have done this 
while the test was going on?)

	Ingo

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: [PATCH] issue storage device flush via sync_blockdev() (was Re: Linux 2.6.29)
  2009-03-26  2:31                                       ` Eric Sandeen
@ 2009-03-26 14:19                                         ` Ric Wheeler
  0 siblings, 0 replies; 664+ messages in thread
From: Ric Wheeler @ 2009-03-26 14:19 UTC (permalink / raw)
  To: Eric Sandeen
  Cc: Jeff Garzik, Linus Torvalds, Theodore Tso, Ingo Molnar, Alan Cox,
	Arjan van de Ven, Andrew Morton, Peter Zijlstra, Nick Piggin,
	David Rees, Jesper Krogh, Linux Kernel Mailing List

On 03/25/2009 10:31 PM, Eric Sandeen wrote:
> Jeff Garzik wrote:
>    
>> Eric Sandeen wrote:
>>      
>
>    
>>> What about when you're running over a big raid device with
>>> battery-backed cache, and you trust the cache as much as much as the
>>> disks.  Wouldn't this unconditional cache flush be painful there on any
>>> of the callers even if they're rare?  (fs unmounts, freezes, unmounts,
>>> etc?  Or a fat filesystem on that device doing an fsync?)
>>>        
>> What exactly do you think sync_blockdev() does?  :)
>>      
>
> It used to push os cached data to the storage.  Now it tells the storage
> to flush cache too (with your patch).  This seems fine in general,
> although it's not a panacea for all the various data integrity issues
> that are being tossed about in this thread.  :)
>
>    
>> It is used right before a volume goes away.  If that's not a time to
>> flush the cache, I dunno what is.
>>
>> The _whole purpose_ of sync_blockdev() is to push out the data to
>> permanent storage.  Look at the users -- unmount volume, journal close,
>> etc.  Things that are OK to occur after those points include: power off,
>> device unplug, etc.
>>      
>
> Sure.  But I was thinking about enterprise raids with battery backup
> which may last for days.
>
> But, ok, I wasn't thinking quite right about the unmount situations etc;
> even on enterprise raids like this, flushing things out on unmount makes
> sense in the case where you lose power post-unmount and can't restore
> power before the battery backup dies.
>    

Enterprise raid systems don't have this issue. They all have sufficient 
battery power to safely destage the volatile cache to persistent storage 
on power outage (i.e., they keep enough drives spinning and so on to 
empty the cache).

Hardware RAID cards that you have in some servers (not external RAID 
arrays) would have this issue though and are probably fairly common.

> I also wondered if a cache flush on one lun issues a cache flush for the
> entire controller, or just for that lun.  Hopefully the latter, in which
> case it's not as big a deal.
>    

Probably just for that LUN - LUN's are usually independent in almost all 
ways. Firmware of course could do anything, I would assume that most 
high end arrays would probably ignore a flush command in any case.

>    
>> A secondary purpose of sync_blockdev() is as a hack, for simple/ancient
>> bdev-based filesystems that do not wish to bother with barriers and all
>> the associated complexity to tracking what writes do/do not need flushing.
>>      
>
> Yep, I get that and it seems reasonable except ....
>
>    
>>> xfs, reiserfs, ext4 all avoid the blkdev flush on fsync if barriers are
>>> not enabled, I think for that reason...
>>>        
>> Enabling barriers causes slowdowns far greater than that of simply
>> causing fsync(2) to trigger FLUSH CACHE, because barriers imply FLUSH
>> CACHE issuance for all in-kernel filesystem journalled/atomic
>> transactions, in addition to whatever syscalls userspace is issuing.
>>
>> The number of FLUSH CACHES w/ barriers is orders of magnitude larger
>> than the number of fsync/fdatasync calls.
>>      
>
> I understand all that.
>
> My point is that the above filesystems (xfs, reiserfs, ext4) skip the
> blkdev flush on fsync when barriers are explicitly disabled.
>
> They do this because if an admin disables barriers, they are trusting
> that the write cache is nonvolatile and will be able to destage fully
> even if external power is lost for some time.
>
> In that case you don't need a blkdev_issue_flush on fsync either (or are
> at least willing to live with the diminished risk, thanks to the battery
> backup), and on xfs, ext4 etc you can turn it off (it goes away w/ the
> barriers off setting).  With this change to the simple generic fsync
> path, you can't turn it off for those filesystems that use it for fsync.
>
> But I suppose it's rare that anybody ever uses a filesystem which uses
> this generic sync method on any sort of interesting storage like I'm
> talking about, and it's not a big deal...  (or maybe that interesting
> storage just ignores cache flushes anyway, I dunno).
>
> My main concerns were that these extra cache flushes for fsync aren't
> tunable, and that flushes on one lun might affect other luns.  I guess
> I've talked myself out of those concerns in a couple different ways now.  ;)
>
> -Eric
>    

I do enthusiastically agree that we should not be doing barriers and the 
blkdev flush for file systems that do barriers correctly.

ric


^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: ext3 IO latency measurements (was: Linux 2.6.29)
  2009-03-26 11:08                                 ` Jens Axboe
@ 2009-03-26 14:27                                   ` Arjan van de Ven
  2009-03-26 14:36                                     ` Jens Axboe
  0 siblings, 1 reply; 664+ messages in thread
From: Arjan van de Ven @ 2009-03-26 14:27 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Ingo Molnar, Jan Kara, Linus Torvalds, Theodore Tso,
	Andrew Morton, Alan Cox, Peter Zijlstra, Nick Piggin, David Rees,
	Jesper Krogh, Linux Kernel Mailing List, Oleg Nesterov,
	Roland McGrath

On Thu, 26 Mar 2009 12:08:15 +0100
Jens Axboe <jens.axboe@oracle.com> wrote:
> 
> Did you capture the trace of the long delays in the read test case? It
> can be two things, at least. One is that each little read takes much
> longer than it should, the other is that we get stuck waiting on a
> dirty page and hence that slows down the reads a lot.
> 

would be interesting to run latencytop during such runs...


-- 
Arjan van de Ven 	Intel Open Source Technology Centre
For development, discussion and tips for power savings, 
visit http://www.lesswatts.org

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: ext3 IO latency measurements (was: Linux 2.6.29)
  2009-03-26 14:03                                   ` Ingo Molnar
  2009-03-26 14:13                                     ` Ingo Molnar
@ 2009-03-26 14:30                                     ` Andrew Morton
  2009-03-26 15:32                                       ` relatime: update once per day patches (was: ext3 IO latency measurements) Frans Pop
  2009-03-26 14:47                                     ` ext3 IO latency measurements (was: Linux 2.6.29) Theodore Tso
  2009-03-26 15:28                                     ` Theodore Tso
  3 siblings, 1 reply; 664+ messages in thread
From: Andrew Morton @ 2009-03-26 14:30 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Theodore Tso, Jan Kara, Linus Torvalds, Alan Cox,
	Arjan van de Ven, Peter Zijlstra, Nick Piggin, Jens Axboe,
	David Rees, Jesper Krogh, Linux Kernel Mailing List,
	Oleg Nesterov, Roland McGrath

On Thu, 26 Mar 2009 15:03:12 +0100 Ingo Molnar <mingo@elte.hu> wrote:

> Why arent the relatime patches upstream?

They have been for ages.

> Why isnt it the default?

That would have been a non-backward-compatible change.

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: ext3 IO latency measurements (was: Linux 2.6.29)
  2009-03-26 14:27                                   ` Arjan van de Ven
@ 2009-03-26 14:36                                     ` Jens Axboe
  2009-03-26 14:49                                       ` Arjan van de Ven
  0 siblings, 1 reply; 664+ messages in thread
From: Jens Axboe @ 2009-03-26 14:36 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Ingo Molnar, Jan Kara, Linus Torvalds, Theodore Tso,
	Andrew Morton, Alan Cox, Peter Zijlstra, Nick Piggin, David Rees,
	Jesper Krogh, Linux Kernel Mailing List, Oleg Nesterov,
	Roland McGrath

On Thu, Mar 26 2009, Arjan van de Ven wrote:
> On Thu, 26 Mar 2009 12:08:15 +0100
> Jens Axboe <jens.axboe@oracle.com> wrote:
> > 
> > Did you capture the trace of the long delays in the read test case? It
> > can be two things, at least. One is that each little read takes much
> > longer than it should, the other is that we get stuck waiting on a
> > dirty page and hence that slows down the reads a lot.
> > 
> 
> would be interesting to run latencytop during such runs...

Things just drown in the noise in there sometimes, is my experience...
And disappear. Perhaps I'm just not very good at reading the output, but
it was never very useful for me. I know, not a very useful complaint,
I'll try and use it again and come up with something more productive :-)

But in this case, I bet it's also the atime updates. If Ingo turned
those off, the read results would likely be a lot better (and
consistent).

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: ext3 IO latency measurements (was: Linux 2.6.29)
  2009-03-26  9:06                               ` ext3 IO latency measurements (was: Linux 2.6.29) Ingo Molnar
                                                   ` (3 preceding siblings ...)
  2009-03-26 12:22                                 ` Pekka Enberg
@ 2009-03-26 14:38                                 ` Andrew Morton
  2009-03-26 18:11                                 ` Jan Kara
  2009-04-09 21:59                                 ` updated: ext3 IO latency measurements on v2.6.30-rc1 Ingo Molnar
  6 siblings, 0 replies; 664+ messages in thread
From: Andrew Morton @ 2009-03-26 14:38 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Jan Kara, Linus Torvalds, Theodore Tso, Alan Cox,
	Arjan van de Ven, Peter Zijlstra, Nick Piggin, Jens Axboe,
	David Rees, Jesper Krogh, Linux Kernel Mailing List,
	Oleg Nesterov, Roland McGrath

On Thu, 26 Mar 2009 10:06:30 +0100 Ingo Molnar <mingo@elte.hu> wrote:

> And it's not just sys_fsync(). The script i wrote tests file read 
> latencies. I have created 1000 files with the same size (all copies 
> of kernel/sched.c ;-), and tested their cache-cold plain-cat 
> performance via:
> 
>   for ((i=0;i<1000;i++)); do
>     printf "file #%4d, plain reading it took: " $i
>     /usr/bin/time -f "%e seconds."  cat $i >/dev/null
>   done
> 
> I.e. plain, supposedly high-prio reads. The result is very common 
> hickups in read latencies:
> 
> file # 579 (253560 bytes), reading it took: 0.08 seconds.
> file # 580 (253560 bytes), reading it took: 0.05 seconds.
> file # 581 (253560 bytes), reading it took: 0.01 seconds.
> file # 582 (253560 bytes), reading it took: 0.01 seconds.
> file # 583 (253560 bytes), reading it took: 4.61 seconds.
> file # 584 (253560 bytes), reading it took: 1.29 seconds.
> file # 585 (253560 bytes), reading it took: 3.01 seconds.
> file # 586 (253560 bytes), reading it took: 7.74 seconds.
> file # 587 (253560 bytes), reading it took: 3.22 seconds.
> file # 588 (253560 bytes), reading it took: 0.05 seconds.
> file # 589 (253560 bytes), reading it took: 0.36 seconds.
> file # 590 (253560 bytes), reading it took: 7.39 seconds.
> file # 591 (253560 bytes), reading it took: 7.58 seconds.
> file # 592 (253560 bytes), reading it took: 7.90 seconds.
> file # 593 (253560 bytes), reading it took: 8.78 seconds.
> file # 594 (253560 bytes), reading it took: 8.01 seconds.
> file # 595 (253560 bytes), reading it took: 7.47 seconds.
> file # 596 (253560 bytes), reading it took: 11.52 seconds.
> file # 597 (253560 bytes), reading it took: 10.33 seconds.
> file # 598 (253560 bytes), reading it took: 8.56 seconds.
> file # 599 (253560 bytes), reading it took: 7.58 seconds.

(gets deja-vu feelings)

http://lkml.org/lkml/2003/2/21/10

Maybe you should be running a 2.5.61 kernel.

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: ext3 IO latency measurements (was: Linux 2.6.29)
  2009-03-26 14:03                                   ` Ingo Molnar
  2009-03-26 14:13                                     ` Ingo Molnar
  2009-03-26 14:30                                     ` Andrew Morton
@ 2009-03-26 14:47                                     ` Theodore Tso
  2009-03-26 16:20                                       ` Linus Torvalds
  2009-03-26 15:28                                     ` Theodore Tso
  3 siblings, 1 reply; 664+ messages in thread
From: Theodore Tso @ 2009-03-26 14:47 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Jan Kara, Linus Torvalds, Andrew Morton, Alan Cox,
	Arjan van de Ven, Peter Zijlstra, Nick Piggin, Jens Axboe,
	David Rees, Jesper Krogh, Linux Kernel Mailing List,
	Oleg Nesterov, Roland McGrath

On Thu, Mar 26, 2009 at 03:03:12PM +0100, Ingo Molnar wrote:
> That's in good deal atime update latencies. We still appear to 
> default to atime enabled in ext4.
> 
> That's stupid - only around 0.01% of all Linux systems relies on 
> atime - and even those who rely on it would be well served by 
> relatime. Why arent the relatime patches upstream? Why isnt it the 
> default? They have been submitted several times.

The relatime patches are upstream.  Both noatime and relatime are
handled at the VFS layer, not at the per-filesystem level.  The reason
why it sin't the default is because of a desire for POSIX compliance,
I suspect.  Most distributions are putting relatime into /etc/fstab by
default, but we haven't changed the mount option.  It wouldn't be hard
to add an "atime" option to turn on atime updates, and make either
"noatime" or "relatime" the default.  This is a simple patch to
fs/namespace.c

> Atime in its current mandatory do-a-write-for-every-read form is a 
> stupid relic and we have been paying the fool's tax for it in the 
> past 10 years.

No argument here.  I use noatime, myself.  It actually saves a lot
more than relatime, and unless you are using mutt with local Maildir
delivery, relatime isn't really that helpful, and the benefit of
noatime is roughly double that of relatime vs normal atime update, in
my measurements:

http://thunk.org/tytso/blog/2009/03/01/ssds-journaling-and-noatimerelatime/

						- Ted

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-26  8:57                                   ` Jens Axboe
@ 2009-03-26 14:47                                     ` Hugh Dickins
  2009-03-26 15:46                                       ` Jens Axboe
  0 siblings, 1 reply; 664+ messages in thread
From: Hugh Dickins @ 2009-03-26 14:47 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Ric Wheeler, Jeff Garzik, Linus Torvalds, Theodore Tso,
	Ingo Molnar, Alan Cox, Arjan van de Ven, Andrew Morton,
	Peter Zijlstra, Nick Piggin, David Rees, Jesper Krogh,
	Linux Kernel Mailing List

On Thu, 26 Mar 2009, Jens Axboe wrote:
> On Wed, Mar 25 2009, Hugh Dickins wrote:
> > 
> > Tangential question, but am I right in thinking that BIO_RW_BARRIER
> > similarly bars across all partitions, whereas its WRITE_BARRIER and
> > DISCARD_BARRIER users would actually prefer it to apply to just one?
> 
> All the barriers refer to just that range which the barrier itself
> references.

Ah, thank you: then I had a fundamental misunderstanding of them,
and need to go away and work that out some more.

Though I didn't read it before asking, doesn't the I/O Barriers section
of Documentation/block/biodoc.txt give a very different impression?

> The problem with the full device flushes is implementation
> on the hardware side, since we can't do small range flushes. So it's not
> as-designed, but rather the best we can do...

Right, that part of it I did get.

Hugh

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: ext3 IO latency measurements (was: Linux 2.6.29)
  2009-03-26 14:36                                     ` Jens Axboe
@ 2009-03-26 14:49                                       ` Arjan van de Ven
  0 siblings, 0 replies; 664+ messages in thread
From: Arjan van de Ven @ 2009-03-26 14:49 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Ingo Molnar, Jan Kara, Linus Torvalds, Theodore Tso,
	Andrew Morton, Alan Cox, Peter Zijlstra, Nick Piggin, David Rees,
	Jesper Krogh, Linux Kernel Mailing List, Oleg Nesterov,
	Roland McGrath

On Thu, 26 Mar 2009 15:36:18 +0100
Jens Axboe <jens.axboe@oracle.com> wrote:

> On Thu, Mar 26 2009, Arjan van de Ven wrote:
> > On Thu, 26 Mar 2009 12:08:15 +0100
> > Jens Axboe <jens.axboe@oracle.com> wrote:
> > > 
> > > Did you capture the trace of the long delays in the read test
> > > case? It can be two things, at least. One is that each little
> > > read takes much longer than it should, the other is that we get
> > > stuck waiting on a dirty page and hence that slows down the reads
> > > a lot.
> > > 
> > 
> > would be interesting to run latencytop during such runs...
> 
> Things just drown in the noise in there sometimes, is my experience...
> And disappear. Perhaps I'm just not very good at reading the output,
> but it was never very useful for me. I know, not a very useful
> complaint, I'll try and use it again and come up with something more
> productive :-)
> 
> But in this case, I bet it's also the atime updates. If Ingo turned
> those off, the read results would likely be a lot better (and
> consistent).

latencytop does identify those for sure.
> 


-- 
Arjan van de Ven 	Intel Open Source Technology Centre
For development, discussion and tips for power savings, 
visit http://www.lesswatts.org

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: ext3 IO latency measurements (was: Linux 2.6.29)
  2009-03-26 14:03                                   ` Ingo Molnar
                                                       ` (2 preceding siblings ...)
  2009-03-26 14:47                                     ` ext3 IO latency measurements (was: Linux 2.6.29) Theodore Tso
@ 2009-03-26 15:28                                     ` Theodore Tso
  2009-03-26 23:02                                       ` Ingo Molnar
  3 siblings, 1 reply; 664+ messages in thread
From: Theodore Tso @ 2009-03-26 15:28 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Jan Kara, Linus Torvalds, Andrew Morton, Alan Cox,
	Arjan van de Ven, Peter Zijlstra, Nick Piggin, Jens Axboe,
	David Rees, Jesper Krogh, Linux Kernel Mailing List,
	Oleg Nesterov, Roland McGrath

On Thu, Mar 26, 2009 at 03:03:12PM +0100, Ingo Molnar wrote:
> 
> I still see similarly bad latencies in Vim:
> 

When you say "similarly bad", how many seconds were you seeing?  I
understand that from the user's perspective, the 120 seconds you saw
with ext3 isn't going to be that different from 15 seconds (which
seems to be the maximum commit time in the jbd2 history file you sent
me), but I'm curious if what you saw was just as bad with ext4, or was
it somewhat better (i.e., 120 seconds vs 15 or so).  Or were you also
seeing a net time to save the file using vim of around 120 seconds
with ext4?

Ext4 in nodelalloc mode is mostly similar to ext3, but it does have
some improvements, such as a slightly elevated I/O priority for
kjournald, and the ext4's writepage doesn't take the journal handle as
it does in ext3.  (That's why I was confused about Linus's assertion
about ext3 waiting on the journal; ext4 doesn't any more, and I had
ext4 on the brain.)  Unfortunately, we don't have the
/proc/fs/jbd/<dev>/history for ext3, so it would be interesting to
compare whether the vim save latencies were improved or not with ext4.
If they are, then it might be worth Jan's time to fix up ext3's
writepage to not try request journal access if it's not needed.  It
might also be worth backporting ext4's slightly raised I/O priority
patch.

Another thing that's worth trying.  Suppose you use ionice to raise
the priority of kjournald to a real-time I/O priority (which is what
Arjan's patch does).  How much does that help?  Is it more or less
compared to what we're seeing with ext4's slightly reaised I/O
priority.

And if we mount the filesystem noatime, does that change the results
significantly as far as Vim write latencies and the jbd2 history file?

> The read-test is somewhat better. There are occasional blips of 4-5 
> seconds:

Presumably these go away once we mount the filesystem noatime, right?

	   	   	    	      	       - Ted

^ permalink raw reply	[flat|nested] 664+ messages in thread

* relatime: update once per day patches (was: ext3 IO latency measurements)
  2009-03-26 14:30                                     ` Andrew Morton
@ 2009-03-26 15:32                                       ` Frans Pop
  2009-03-26 15:47                                         ` Andrew Morton
  0 siblings, 1 reply; 664+ messages in thread
From: Frans Pop @ 2009-03-26 15:32 UTC (permalink / raw)
  To: Andrew Morton
  Cc: mingo, tytso, jack, torvalds, alan, arjan, a.p.zijlstra, npiggin,
	jens.axboe, drees76, jesper, linux-kernel, oleg, roland

Andrew Morton wrote:
> On Thu, 26 Mar 2009 15:03:12 +0100 Ingo Molnar <mingo@elte.hu> wrote:
>> Why arent the relatime patches upstream?
> 
> They have been for ages.

I assume Ingo means the patches to make relatime update atime at least 
once per day to ensure better compatibility with apps that do use or rely 
on access times.
These patches are already being included by several distros and, FWIW, 
Debian would like to see them upstream as well because we feel .

They were last submitted by Matthew Garrett:
http://lkml.org/lkml/2008/11/27/234
http://lkml.org/lkml/2008/11/27/235

Loads of people seem to want this, but even though it's been submitted at 
least twice and discussed even more often, it never gets anywhere.

Cheers,
FJP

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-26 14:47                                     ` Hugh Dickins
@ 2009-03-26 15:46                                       ` Jens Axboe
  2009-03-26 18:21                                         ` Hugh Dickins
  0 siblings, 1 reply; 664+ messages in thread
From: Jens Axboe @ 2009-03-26 15:46 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Ric Wheeler, Jeff Garzik, Linus Torvalds, Theodore Tso,
	Ingo Molnar, Alan Cox, Arjan van de Ven, Andrew Morton,
	Peter Zijlstra, Nick Piggin, David Rees, Jesper Krogh,
	Linux Kernel Mailing List

On Thu, Mar 26 2009, Hugh Dickins wrote:
> On Thu, 26 Mar 2009, Jens Axboe wrote:
> > On Wed, Mar 25 2009, Hugh Dickins wrote:
> > > 
> > > Tangential question, but am I right in thinking that BIO_RW_BARRIER
> > > similarly bars across all partitions, whereas its WRITE_BARRIER and
> > > DISCARD_BARRIER users would actually prefer it to apply to just one?
> > 
> > All the barriers refer to just that range which the barrier itself
> > references.
> 
> Ah, thank you: then I had a fundamental misunderstanding of them,
> and need to go away and work that out some more.
> 
> Though I didn't read it before asking, doesn't the I/O Barriers section
> of Documentation/block/biodoc.txt give a very different impression?

I'm sensing a miscommunication here... The ordering constraint is across
devices, at least that is how it is implemented. For file system
barriers (like BIO_RW_BARRIER), it could be per-partition instead. Doing
so would involve some changes at the block layer side, not necessarily
trivial. So I think you were asking about ordering, I was answering
about the write guarantee :-)

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: relatime: update once per day patches (was: ext3 IO latency measurements)
  2009-03-26 15:32                                       ` relatime: update once per day patches (was: ext3 IO latency measurements) Frans Pop
@ 2009-03-26 15:47                                         ` Andrew Morton
  2009-03-26 16:14                                           ` Linus Torvalds
  0 siblings, 1 reply; 664+ messages in thread
From: Andrew Morton @ 2009-03-26 15:47 UTC (permalink / raw)
  To: Frans Pop
  Cc: mingo, tytso, jack, torvalds, alan, arjan, a.p.zijlstra, npiggin,
	jens.axboe, drees76, jesper, linux-kernel, oleg, roland

On Thu, 26 Mar 2009 16:32:38 +0100 Frans Pop <elendil@planet.nl> wrote:

> Andrew Morton wrote:
> > On Thu, 26 Mar 2009 15:03:12 +0100 Ingo Molnar <mingo@elte.hu> wrote:
> >> Why arent the relatime patches upstream?
> > 
> > They have been for ages.
> 
> I assume Ingo means the patches to make relatime update atime at least 
> once per day to ensure better compatibility with apps that do use or rely 
> on access times.
> These patches are already being included by several distros and, FWIW, 
> Debian would like to see them upstream as well because we feel .
> 
> They were last submitted by Matthew Garrett:
> http://lkml.org/lkml/2008/11/27/234
> http://lkml.org/lkml/2008/11/27/235
> 
> Loads of people seem to want this, but even though it's been submitted at 
> least twice and discussed even more often, it never gets anywhere.
> 

Hard-wiring a 24-hour interval into the core VFS for all mounted
filesystems is dumb.

I (and others) pointed out that it would be better to implement this as
a mount option.  That suggestion was met with varying sillinesses and
that is where things stand.

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: relatime: update once per day patches (was: ext3 IO latency measurements)
  2009-03-26 15:47                                         ` Andrew Morton
@ 2009-03-26 16:14                                           ` Linus Torvalds
  2009-03-26 16:24                                             ` Andrew Morton
                                                               ` (2 more replies)
  0 siblings, 3 replies; 664+ messages in thread
From: Linus Torvalds @ 2009-03-26 16:14 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Frans Pop, mingo, tytso, jack, alan, arjan, a.p.zijlstra,
	npiggin, jens.axboe, drees76, jesper, linux-kernel, oleg, roland



On Thu, 26 Mar 2009, Andrew Morton wrote:
> 
> Hard-wiring a 24-hour interval into the core VFS for all mounted
> filesystems is dumb.

Umm. 

I generally agree witht he "leave policy to user space" people, but this 
is an area where (a) user space has shown itself to not get it right (ie 
people don't do even the existing relatime because distros don't) and (b) 
what's the alternative?

> I (and others) pointed out that it would be better to implement this as
> a mount option.  That suggestion was met with varying sillinesses and
> that is where things stand.

I'd suggest first just doing the 24 hour thing, and then, IF user space 
actually ever gets its act together, and people care, and they _ask_ for a 
mount option, that's when it's worth doing.

			Linus

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: ext3 IO latency measurements (was: Linux 2.6.29)
  2009-03-26 14:47                                     ` ext3 IO latency measurements (was: Linux 2.6.29) Theodore Tso
@ 2009-03-26 16:20                                       ` Linus Torvalds
  2009-03-26 17:06                                         ` Andreas Schwab
                                                           ` (3 more replies)
  0 siblings, 4 replies; 664+ messages in thread
From: Linus Torvalds @ 2009-03-26 16:20 UTC (permalink / raw)
  To: Theodore Tso
  Cc: Ingo Molnar, Jan Kara, Andrew Morton, Alan Cox, Arjan van de Ven,
	Peter Zijlstra, Nick Piggin, Jens Axboe, David Rees,
	Jesper Krogh, Linux Kernel Mailing List, Oleg Nesterov,
	Roland McGrath



On Thu, 26 Mar 2009, Theodore Tso wrote:
>
> Most distributions are putting relatime into /etc/fstab by
> default, but we haven't changed the mount option.

I don't think this is true. Fedora certainly does not. Not in F10, not in 
F11. 

And quite frankly, even if you then _manually_ put 'relatime' in 
/etc/fstab, the default Fedora install will totally ignore it. Why? 
Because it mounts the root partition while using initrd, and totally 
ignores /etc/fstab.

In other words, not only do distributions not do it, but you can't even do 
it by hand afterwards the sane way in the most common distro!

There really is reason for the kernel to just say "user space has sh*t for 
brains, and we'd better change the default - and if some distro really 
_thinks_ about it, and decides that they really want old-fashioned atime, 
let them do that".

Because right now, I do not believe for a moment that any distro that 
defaults to "atime" has spent lots of effort thinking about it. Quite the 
reverse. They probably default to "atime" because they spent no time AT 
ALL thinking about it.

> It wouldn't be hard to add an "atime" option to turn on atime updates, 
> and make either "noatime" or "relatime" the default.  This is a simple 
> patch to fs/namespace.c

Yes. I think we have to.

> No argument here.  I use noatime, myself.  It actually saves a lot
> more than relatime, and unless you are using mutt with local Maildir
> delivery, relatime isn't really that helpful, and the benefit of
> noatime is roughly double that of relatime vs normal atime update, in
> my measurements:
> 
> http://thunk.org/tytso/blog/2009/03/01/ssds-journaling-and-noatimerelatime/

I do agree that "noatime" is better, but with "relatime" you at least are 
likely to not break anything. A program has to be _really_ odd to care 
about the "relatime" vs "atime" behavior.

		Linus

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: relatime: update once per day patches (was: ext3 IO latency measurements)
  2009-03-26 16:14                                           ` Linus Torvalds
@ 2009-03-26 16:24                                             ` Andrew Morton
  2009-03-26 17:12                                               ` Frans Pop
  2009-03-26 16:30                                             ` Theodore Tso
  2009-03-26 17:32                                             ` [PATCH] Allow relatime to update atime once a day Matthew Garrett
  2 siblings, 1 reply; 664+ messages in thread
From: Andrew Morton @ 2009-03-26 16:24 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Frans Pop, mingo, tytso, jack, alan, arjan, a.p.zijlstra,
	npiggin, jens.axboe, drees76, jesper, linux-kernel, oleg, roland

On Thu, 26 Mar 2009 09:14:28 -0700 (PDT) Linus Torvalds <torvalds@linux-foundation.org> wrote:

> 
> 
> On Thu, 26 Mar 2009, Andrew Morton wrote:
> > 
> > Hard-wiring a 24-hour interval into the core VFS for all mounted
> > filesystems is dumb.
> 
> Umm. 
> 
> I generally agree witht he "leave policy to user space" people, but this 
> is an area where (a) user space has shown itself to not get it right (ie 
> people don't do even the existing relatime because distros don't) and (b) 
> what's the alternative?
> 
> > I (and others) pointed out that it would be better to implement this as
> > a mount option.  That suggestion was met with varying sillinesses and
> > that is where things stand.
> 
> I'd suggest first just doing the 24 hour thing, and then, IF user space 
> actually ever gets its act together, and people care, and they _ask_ for a 
> mount option, that's when it's worth doing.
> 

We wouldn't normally just enable the new feature by default because it
changes kernel behaviour.  Userspace needs to be changed in some manner to
opt-in.  One way it's `mount -o remount', the other way it's a poke in
/proc.




^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-26  1:34                                 ` Linus Torvalds
  2009-03-26  2:59                                   ` Theodore Tso
@ 2009-03-26 16:24                                   ` Jan Kara
  1 sibling, 0 replies; 664+ messages in thread
From: Jan Kara @ 2009-03-26 16:24 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Theodore Tso, Andrew Morton, Ingo Molnar, Alan Cox,
	Arjan van de Ven, Peter Zijlstra, Nick Piggin, Jens Axboe,
	David Rees, Jesper Krogh, Linux Kernel Mailing List

On Wed 25-03-09 18:34:32, Linus Torvalds wrote:
> On Thu, 26 Mar 2009, Jan Kara wrote:
> >
> >  1) We have to writeout blocks full of zeros on allocation so that we don't
> > expose unallocated data => slight slowdown
> 
> Why? 
> 
> This is in _no_ way different from a regular "write()" system call. And 
> there, we just attach the buffers to the page. If something crashes before 
> the page actually gets written out, then we'll have hopefully never 
> written out the metadata (that's what "data=ordered" means).
  Sorry, I wasn't exact enough. We'll attach buffers to the running
transaction and they'll get written out at the transaction commit which is
usually earlier than when the writepage() is called and then later
writepage() will write the data again (this is a consequence of the fact
that JBD commit code just writes buffers without calling
clear_page_dirty_for_io())...
  At least ext4 has this fixed because JBD2 already writes out ordered data
via writepages().

									Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: relatime: update once per day patches (was: ext3 IO latency measurements)
  2009-03-26 16:14                                           ` Linus Torvalds
  2009-03-26 16:24                                             ` Andrew Morton
@ 2009-03-26 16:30                                             ` Theodore Tso
  2009-03-26 16:40                                               ` Jose Celestino
                                                                 ` (2 more replies)
  2009-03-26 17:32                                             ` [PATCH] Allow relatime to update atime once a day Matthew Garrett
  2 siblings, 3 replies; 664+ messages in thread
From: Theodore Tso @ 2009-03-26 16:30 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrew Morton, Frans Pop, mingo, jack, alan, arjan, a.p.zijlstra,
	npiggin, jens.axboe, drees76, jesper, linux-kernel, oleg, roland

On Thu, Mar 26, 2009 at 09:14:28AM -0700, Linus Torvalds wrote:
> I generally agree witht he "leave policy to user space" people, but this 
> is an area where (a) user space has shown itself to not get it right (ie 
> people don't do even the existing relatime because distros don't) and (b) 
> what's the alternative?

I thought at least some distro's were adding relatime by default; I
could be wrong, but I thought Ubuntu was doing this.  Personally, I
actually think that if we're going to give up on POSIX, I'll go all
the way to noatime since it helps even more.

I've always thought the right approach would be to have a "atime
dirty" flag, and update atime, but never flush it out to disk unless
(a) we're about to unmount the disk, or (b) we need to update some
other inode in the same inode table block, or (c) we have memory
pressure and we're trying to evict the inode from the inode cache.
That way we get full POSIX compliance, without taking the I/O hit of
atime updates.  The atime updates get lost if we crash, but that's
allowed by POSIX, and most people don't care about losing atime
updates after a crash.

Since it's fully backwards (and POSIX) compatible, there would no
question about enabling it by default.

						- Ted

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: relatime: update once per day patches (was: ext3 IO latency measurements)
  2009-03-26 16:30                                             ` Theodore Tso
@ 2009-03-26 16:40                                               ` Jose Celestino
  2009-03-26 17:14                                                 ` Frans Pop
  2009-03-26 16:53                                               ` Frans Pop
  2009-03-26 16:53                                               ` Linus Torvalds
  2 siblings, 1 reply; 664+ messages in thread
From: Jose Celestino @ 2009-03-26 16:40 UTC (permalink / raw)
  To: Theodore Tso, Linus Torvalds, Andrew Morton, Frans Pop, mingo,
	jack, alan, arjan, a.p.zijlstra, npiggin, jens.axboe, drees76,
	jesper, linux-kernel, oleg, roland

Words by Theodore Tso [Thu, Mar 26, 2009 at 12:30:26PM -0400]:
> On Thu, Mar 26, 2009 at 09:14:28AM -0700, Linus Torvalds wrote:
> > I generally agree witht he "leave policy to user space" people, but this 
> > is an area where (a) user space has shown itself to not get it right (ie 
> > people don't do even the existing relatime because distros don't) and (b) 
> > what's the alternative?
> 
> I thought at least some distro's were adding relatime by default; I
> could be wrong, but I thought Ubuntu was doing this.

Yes.

-- 
Jose Celestino | http://japc.uncovering.org/files/japc-pgpkey.asc
----------------------------------------------------------------
"One man’s theology is another man’s belly laugh." -- Robert A. Heinlein

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: relatime: update once per day patches (was: ext3 IO latency measurements)
  2009-03-26 16:30                                             ` Theodore Tso
  2009-03-26 16:40                                               ` Jose Celestino
@ 2009-03-26 16:53                                               ` Frans Pop
  2009-03-26 16:53                                               ` Linus Torvalds
  2 siblings, 0 replies; 664+ messages in thread
From: Frans Pop @ 2009-03-26 16:53 UTC (permalink / raw)
  To: Theodore Tso, Linus Torvalds, Andrew Morton, mingo, jack, alan,
	arjan, a.p.zijlstra, npiggin, jens.axboe, drees76, jesper,
	linux-kernel, oleg, roland

On Thursday 26 March 2009, Theodore Tso wrote:
> I thought at least some distro's were adding relatime by default; I
> could be wrong, but I thought Ubuntu was doing this.  Personally, I
> actually think that if we're going to give up on POSIX, I'll go all
> the way to noatime since it helps even more.

They indeed do have relatime by default, but *only because* they have the 
additional patch with the 24 hour limit in their kernel [1]. The same is 
true for Fedora IIUC.

Debian would like to activate it by default as well for new installations, 
but has so far been blocked from doing that because our kernel team has a 
policy of not including patches that are not upstream (or at least, in 
the process of being included upstream). And the Debian Installer team 
has so far felt that it would be irresponsible of activating it by 
default without this safeguard.

Cheers,
FJP

[1] https://bugs.launchpad.net/ubuntu/+source/linux/+bug/199427
linux (2.6.24-12.18) hardy; urgency=low
[...]
  * build/configs: Enable relatime config option for all flavors

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: relatime: update once per day patches (was: ext3 IO latency measurements)
  2009-03-26 16:30                                             ` Theodore Tso
  2009-03-26 16:40                                               ` Jose Celestino
  2009-03-26 16:53                                               ` Frans Pop
@ 2009-03-26 16:53                                               ` Linus Torvalds
  2 siblings, 0 replies; 664+ messages in thread
From: Linus Torvalds @ 2009-03-26 16:53 UTC (permalink / raw)
  To: Theodore Tso
  Cc: Andrew Morton, Frans Pop, mingo, jack, alan, arjan, a.p.zijlstra,
	npiggin, jens.axboe, drees76, jesper, linux-kernel, oleg, roland



On Thu, 26 Mar 2009, Theodore Tso wrote:
> 
> I've always thought the right approach would be to have a "atime
> dirty" flag, and update atime, but never flush it out to disk unless
> (a) we're about to unmount the disk, or (b) we need to update some
> other inode in the same inode table block, or (c) we have memory
> pressure and we're trying to evict the inode from the inode cache.

I tried to do that a few years ago (ok, probably more than a few by now).

It was surprisingly hard.

Some of it is absolutely trivial: we already have multiple "dirty" flags 
for the inode (I_DIRTY_SYNC vs I_DIRTY_DATASYNC vs I_DIRTY_PAGES). Adding 
a I_DIRTY_ATIME bit for unimportant data was trivial.

But at least back then, "sync_inode()" (or whatever) was called without 
the reason for doing the sync, so it was really hard to decide whether to 
write things out or not.

That may actually have changed these days. We now have that 
"writeback_control" thing that we pass around for all the IO.

Heh. I just looked back in the history. That writeback_control thing was 
added back in 2002, so it's a _really_ long time since I tried to do that 
whole atime thing.

Maybe it's really easy these days.

				Linus

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: ext3 IO latency measurements (was: Linux 2.6.29)
  2009-03-26 16:20                                       ` Linus Torvalds
@ 2009-03-26 17:06                                         ` Andreas Schwab
  2009-03-26 17:07                                         ` Theodore Tso
                                                           ` (2 subsequent siblings)
  3 siblings, 0 replies; 664+ messages in thread
From: Andreas Schwab @ 2009-03-26 17:06 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Theodore Tso, Ingo Molnar, Jan Kara, Andrew Morton, Alan Cox,
	Arjan van de Ven, Peter Zijlstra, Nick Piggin, Jens Axboe,
	David Rees, Jesper Krogh, Linux Kernel Mailing List,
	Oleg Nesterov, Roland McGrath

Linus Torvalds <torvalds@linux-foundation.org> writes:

> And quite frankly, even if you then _manually_ put 'relatime' in 
> /etc/fstab, the default Fedora install will totally ignore it. Why? 
> Because it mounts the root partition while using initrd, and totally 
> ignores /etc/fstab.

That works here in openSUSE 11.1.  The initrd remounts the rootfs with
any options it founds in /etc/fstab.

Andreas.

-- 
Andreas Schwab, schwab@linux-m68k.org
GPG Key fingerprint = 58CA 54C7 6D53 942B 1756  01D3 44D5 214B 8276 4ED5
"And now for something completely different."

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: ext3 IO latency measurements (was: Linux 2.6.29)
  2009-03-26 16:20                                       ` Linus Torvalds
  2009-03-26 17:06                                         ` Andreas Schwab
@ 2009-03-26 17:07                                         ` Theodore Tso
  2009-03-26 17:16                                           ` Linus Torvalds
  2009-03-26 17:29                                         ` ext3 IO latency measurements (was: Linux 2.6.29) Frans Pop
  2009-03-26 17:32                                         ` Bill Nottingham
  3 siblings, 1 reply; 664+ messages in thread
From: Theodore Tso @ 2009-03-26 17:07 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ingo Molnar, Jan Kara, Andrew Morton, Alan Cox, Arjan van de Ven,
	Peter Zijlstra, Nick Piggin, Jens Axboe, David Rees,
	Jesper Krogh, Linux Kernel Mailing List, Oleg Nesterov,
	Roland McGrath

On Thu, Mar 26, 2009 at 09:20:14AM -0700, Linus Torvalds wrote:
> 
> 
> On Thu, 26 Mar 2009, Theodore Tso wrote:
> >
> > Most distributions are putting relatime into /etc/fstab by
> > default, but we haven't changed the mount option.
> 
> I don't think this is true. Fedora certainly does not. Not in F10, not in 
> F11. 

Ubuntu does.   I thought Fedora had, but I stand corrected.

> And quite frankly, even if you then _manually_ put 'relatime' in 
> /etc/fstab, the default Fedora install will totally ignore it. Why? 
> Because it mounts the root partition while using initrd, and totally 
> ignores /etc/fstab.

You can, actually, but it requires hacking /boot/grub/menu.list.  The
boot command option "rootflags=noatime" should do it, if their initrd
scripts are at all sane (and they honor rootfstype, so they probably
do also honor rootflags).

The question is whether we can make Fedora 11 and OpenSUSE do the
right thing now that this has become a highly visible discussion.  I'm
actually fairly optimistic on this front.  (Maybe some distro folks
will care to chime in on whether upcoming releases of F11 and OpenSuSE
can be changed to DTRT?)

Actually, given where F11 is on its release schedule, I suspect it
would be *easier* for them to make a change to default boot options in
grub's menu.conf than it would be backport a kernel patch, since they
will be releasing their beta release within the week, and their final
development freeze is in less than two weeks.

						- Ted

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: relatime: update once per day patches (was: ext3 IO latency measurements)
  2009-03-26 16:24                                             ` Andrew Morton
@ 2009-03-26 17:12                                               ` Frans Pop
  2009-03-26 17:48                                                 ` Andrew Morton
  0 siblings, 1 reply; 664+ messages in thread
From: Frans Pop @ 2009-03-26 17:12 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, mingo, tytso, jack, alan, arjan, a.p.zijlstra,
	npiggin, jens.axboe, drees76, jesper, linux-kernel, oleg, roland

On Thursday 26 March 2009, Andrew Morton wrote:
> On Thu, 26 Mar 2009 09:14:28 -0700 (PDT) Linus Torvalds 
<torvalds@linux-foundation.org> wrote:
> > I generally agree witht he "leave policy to user space" people, but
> > this is an area where (a) user space has shown itself to not get it
> > right (ie people don't do even the existing relatime because distros
> > don't) and (b) what's the alternative?
> >
> > > I (and others) pointed out that it would be better to implement
> > > this as a mount option.  That suggestion was met with varying
> > > sillinesses and that is where things stand.
> >
> > I'd suggest first just doing the 24 hour thing, and then, IF user
> > space actually ever gets its act together, and people care, and they
> > _ask_ for a mount option, that's when it's worth doing.
>
> We wouldn't normally just enable the new feature by default because it
> changes kernel behaviour.  Userspace needs to be changed in some manner
> to opt-in.  One way it's `mount -o remount', the other way it's a poke
> in /proc.

What change are you talking about here exactly? The "change relatime to 
have a 24 hour safeguard" of Matthes's first patch or the "enable 
relatime by default" options in the second patch?

For the first I don't think it's that big a deal as it is a change that 
makes the behavior of relatime safer and not riskier. Also, it's 
something people have argued should have been part of the initial 
functionality of relatime (it was part of the discussion back then), and 
finally for a lot of users it's already current functionality as major 
distros already do include the patch.

For the second, I can see your point and can understand reservations to 
make enabling relatime a kernel config option.

Speaking exclusively for myself, I would be happy enough if only the first 
of Matthew's patches would get accepted.

Cheers,
FJP

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: relatime: update once per day patches (was: ext3 IO latency measurements)
  2009-03-26 16:40                                               ` Jose Celestino
@ 2009-03-26 17:14                                                 ` Frans Pop
  0 siblings, 0 replies; 664+ messages in thread
From: Frans Pop @ 2009-03-26 17:14 UTC (permalink / raw)
  To: Jose Celestino
  Cc: Theodore Tso, Linus Torvalds, Andrew Morton, mingo, jack, alan,
	arjan, a.p.zijlstra, npiggin, jens.axboe, drees76, jesper,
	linux-kernel, oleg, roland

On Thursday 26 March 2009, Jose Celestino wrote:
> Words by Theodore Tso [Thu, Mar 26, 2009 at 12:30:26PM -0400]:
> > On Thu, Mar 26, 2009 at 09:14:28AM -0700, Linus Torvalds wrote:
> > > I generally agree witht he "leave policy to user space" people, but
> > > this is an area where (a) user space has shown itself to not get it
> > > right (ie people don't do even the existing relatime because
> > > distros don't) and (b) what's the alternative?
> >
> > I thought at least some distro's were adding relatime by default; I
> > could be wrong, but I thought Ubuntu was doing this.
>
> Yes.

No, not the "pure" relatime that's in the upstream kernel. And that's the 
whole point here. See my direct reply to Ted.

Cheers,
FJP

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: ext3 IO latency measurements (was: Linux 2.6.29)
  2009-03-26 17:07                                         ` Theodore Tso
@ 2009-03-26 17:16                                           ` Linus Torvalds
  2009-03-26 17:49                                             ` [PATCH 1/2] Add a strictatime mount option Matthew Garrett
  2009-03-26 18:59                                             ` ext3 IO latency measurements (was: Linux 2.6.29) Alan Cox
  0 siblings, 2 replies; 664+ messages in thread
From: Linus Torvalds @ 2009-03-26 17:16 UTC (permalink / raw)
  To: Theodore Tso
  Cc: Ingo Molnar, Jan Kara, Andrew Morton, Alan Cox, Arjan van de Ven,
	Peter Zijlstra, Nick Piggin, Jens Axboe, David Rees,
	Jesper Krogh, Linux Kernel Mailing List, Oleg Nesterov,
	Roland McGrath



On Thu, 26 Mar 2009, Theodore Tso wrote:
> 
> You can, actually, but it requires hacking /boot/grub/menu.list.  The
> boot command option "rootflags=noatime" should do it, if their initrd
> scripts are at all sane (and they honor rootfstype, so they probably
> do also honor rootflags).

Not when I tried it. It just causes the initrd to be mounted noatime, and 
then the real root filesystem gets mounted atime again.

Maybe I screwed up. But I don't think so.

> The question is whether we can make Fedora 11 and OpenSUSE do the
> right thing now that this has become a highly visible discussion.  I'm
> actually fairly optimistic on this front.  (Maybe some distro folks
> will care to chime in on whether upcoming releases of F11 and OpenSuSE
> can be changed to DTRT?)

And what's the argument for not doing it in the kernel?

The fact is, "atime" by default is just wrong.

			Linus

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: ext3 IO latency measurements (was: Linux 2.6.29)
  2009-03-26 16:20                                       ` Linus Torvalds
  2009-03-26 17:06                                         ` Andreas Schwab
  2009-03-26 17:07                                         ` Theodore Tso
@ 2009-03-26 17:29                                         ` Frans Pop
  2009-03-26 17:32                                         ` Bill Nottingham
  3 siblings, 0 replies; 664+ messages in thread
From: Frans Pop @ 2009-03-26 17:29 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: tytso, mingo, jack, akpm, alan, arjan, a.p.zijlstra, npiggin,
	jens.axboe, drees76, jesper, linux-kernel, oleg, roland

Linus Torvalds wrote:
> And quite frankly, even if you then _manually_ put 'relatime' in
> /etc/fstab, the default Fedora install will totally ignore it. Why?
> Because it mounts the root partition while using initrd, and totally
> ignores /etc/fstab.

That's a difference between Fedora's initrd and Debian/Ubuntu's 
initramfs-tools then. We do respect the mount options in fstab for the 
root partition when root is mounted from the initrd:

$ cat /etc/fstab | grep " / "
/dev/mapper/main-root /      ext3    relatime,errors=remount-ro 0       1

$ mount | grep " / "
/dev/mapper/main-root on / type ext3 (rw,relatime,errors=remount-ro)

$ cat /proc/cmdline
root=/dev/mapper/main-root ro vga=791 quiet

Cheers,
FJP

^ permalink raw reply	[flat|nested] 664+ messages in thread

* [PATCH] Allow relatime to update atime once a day
  2009-03-26 16:14                                           ` Linus Torvalds
  2009-03-26 16:24                                             ` Andrew Morton
  2009-03-26 16:30                                             ` Theodore Tso
@ 2009-03-26 17:32                                             ` Matthew Garrett
  2009-03-26 17:56                                               ` Alexey Dobriyan
  2009-03-26 18:55                                               ` Alan Cox
  2 siblings, 2 replies; 664+ messages in thread
From: Matthew Garrett @ 2009-03-26 17:32 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrew Morton, Frans Pop, mingo, tytso, jack, alan, arjan,
	a.p.zijlstra, npiggin, jens.axboe, drees76, jesper, linux-kernel,
	oleg, roland, willy, vaurora

Allow atime to be updated once per day even with relatime. This lets
utilities like tmpreaper (which delete files based on last access time)
continue working, making relatime a plausible default for distributions.
    
Signed-off-by: Matthew Garrett <mjg@redhat.com>
Reviewed-by: Matthew Wilcox <willy@linux.intel.com>
Acked-by: Valerie Aurora Henson <vaurora@redhat.com>
Acked-by: Alan Cox <alan@redhat.com>
Acked-by: Ingo Molnar <mingo@elte.hu>

---

diff --git a/fs/inode.c b/fs/inode.c
index 0487ddb..057c92b 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -1179,6 +1179,40 @@ sector_t bmap(struct inode * inode, sector_t block)
 }
 EXPORT_SYMBOL(bmap);
 
+/*
+ * With relative atime, only update atime if the previous atime is
+ * earlier than either the ctime or mtime or if at least a day has
+ * passed since the last atime update.
+ */
+static int relatime_need_update(struct vfsmount *mnt, struct inode *inode,
+			     struct timespec now)
+{
+
+	if (!(mnt->mnt_flags & MNT_RELATIME))
+		return 1;
+	/*
+	 * Is mtime younger than atime? If yes, update atime:
+	 */
+	if (timespec_compare(&inode->i_mtime, &inode->i_atime) >= 0)
+		return 1;
+	/*
+	 * Is ctime younger than atime? If yes, update atime:
+	 */
+	if (timespec_compare(&inode->i_ctime, &inode->i_atime) >= 0)
+		return 1;
+
+	/*
+	 * Is the previous atime value older than a day? If yes,
+	 * update atime:
+	 */
+	if ((long)(now.tv_sec - inode->i_atime.tv_sec) >= 24*60*60)
+		return 1;
+	/*
+	 * Good, we can skip the atime update:
+	 */
+	return 0;
+}
+
 /**
  *	touch_atime	-	update the access time
  *	@mnt: mount the inode is accessed on
@@ -1206,17 +1240,12 @@ void touch_atime(struct vfsmount *mnt, struct dentry *dentry)
 		goto out;
 	if ((mnt->mnt_flags & MNT_NODIRATIME) && S_ISDIR(inode->i_mode))
 		goto out;
-	if (mnt->mnt_flags & MNT_RELATIME) {
-		/*
-		 * With relative atime, only update atime if the previous
-		 * atime is earlier than either the ctime or mtime.
-		 */
-		if (timespec_compare(&inode->i_mtime, &inode->i_atime) < 0 &&
-		    timespec_compare(&inode->i_ctime, &inode->i_atime) < 0)
-			goto out;
-	}
 
 	now = current_fs_time(inode->i_sb);
+
+	if (!relatime_need_update(mnt, inode, now))
+		goto out;
+
 	if (timespec_equal(&inode->i_atime, &now))
 		goto out;

-- 
Matthew Garrett | mjg59@srcf.ucam.org

^ permalink raw reply related	[flat|nested] 664+ messages in thread

* Re: ext3 IO latency measurements (was: Linux 2.6.29)
  2009-03-26 16:20                                       ` Linus Torvalds
                                                           ` (2 preceding siblings ...)
  2009-03-26 17:29                                         ` ext3 IO latency measurements (was: Linux 2.6.29) Frans Pop
@ 2009-03-26 17:32                                         ` Bill Nottingham
  2009-03-26 17:41                                           ` Linus Torvalds
  2009-03-26 18:54                                           ` Alan Cox
  3 siblings, 2 replies; 664+ messages in thread
From: Bill Nottingham @ 2009-03-26 17:32 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Theodore Tso, Ingo Molnar, Jan Kara, Andrew Morton, Alan Cox,
	Arjan van de Ven, Peter Zijlstra, Nick Piggin, Jens Axboe,
	David Rees, Jesper Krogh, Linux Kernel Mailing List,
	Oleg Nesterov, Roland McGrath

Linus Torvalds (torvalds@linux-foundation.org) said: 
> And quite frankly, even if you then _manually_ put 'relatime' in 
> /etc/fstab, the default Fedora install will totally ignore it. Why? 
> Because it mounts the root partition while using initrd, and totally 
> ignores /etc/fstab.

It should honor /etc/fstab changes, if the initramfs is rebuilt
after the change is made. If it doesn't, that's a bug.

Bill

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: ext3 IO latency measurements (was: Linux 2.6.29)
  2009-03-26 17:32                                         ` Bill Nottingham
@ 2009-03-26 17:41                                           ` Linus Torvalds
  2009-03-26 18:23                                             ` Bill Nottingham
  2009-03-26 18:54                                           ` Alan Cox
  1 sibling, 1 reply; 664+ messages in thread
From: Linus Torvalds @ 2009-03-26 17:41 UTC (permalink / raw)
  To: Bill Nottingham
  Cc: Theodore Tso, Ingo Molnar, Jan Kara, Andrew Morton, Alan Cox,
	Arjan van de Ven, Peter Zijlstra, Nick Piggin, Jens Axboe,
	David Rees, Jesper Krogh, Linux Kernel Mailing List,
	Oleg Nesterov, Roland McGrath



On Thu, 26 Mar 2009, Bill Nottingham wrote:

> Linus Torvalds (torvalds@linux-foundation.org) said: 
> > And quite frankly, even if you then _manually_ put 'relatime' in 
> > /etc/fstab, the default Fedora install will totally ignore it. Why? 
> > Because it mounts the root partition while using initrd, and totally 
> > ignores /etc/fstab.
> 
> It should honor /etc/fstab changes, if the initramfs is rebuilt
> after the change is made. If it doesn't, that's a bug.

Why the hell should I rebuild initramfs? 

Anyway, I fixed it. I don't use initramfs any more, after all the idiocies 
it has done. I had to make everything primary partitions in order to do 
that, but hey, that solved a lot of other problems too, so that was no 
loss.

			Linus

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: relatime: update once per day patches (was: ext3 IO latency measurements)
  2009-03-26 17:12                                               ` Frans Pop
@ 2009-03-26 17:48                                                 ` Andrew Morton
  2009-03-26 18:49                                                   ` Matthew Garrett
  0 siblings, 1 reply; 664+ messages in thread
From: Andrew Morton @ 2009-03-26 17:48 UTC (permalink / raw)
  To: Frans Pop
  Cc: Linus Torvalds, mingo, tytso, jack, alan, arjan, a.p.zijlstra,
	npiggin, jens.axboe, drees76, jesper, linux-kernel, oleg, roland

On Thu, 26 Mar 2009 18:12:04 +0100 Frans Pop <elendil@planet.nl> wrote:

> On Thursday 26 March 2009, Andrew Morton wrote:
> > On Thu, 26 Mar 2009 09:14:28 -0700 (PDT) Linus Torvalds 
> <torvalds@linux-foundation.org> wrote:
> > > I generally agree witht he "leave policy to user space" people, but
> > > this is an area where (a) user space has shown itself to not get it
> > > right (ie people don't do even the existing relatime because distros
> > > don't) and (b) what's the alternative?
> > >
> > > > I (and others) pointed out that it would be better to implement
> > > > this as a mount option.  That suggestion was met with varying
> > > > sillinesses and that is where things stand.
> > >
> > > I'd suggest first just doing the 24 hour thing, and then, IF user
> > > space actually ever gets its act together, and people care, and they
> > > _ask_ for a mount option, that's when it's worth doing.
> >
> > We wouldn't normally just enable the new feature by default because it
> > changes kernel behaviour.  Userspace needs to be changed in some manner
> > to opt-in.  One way it's `mount -o remount', the other way it's a poke
> > in /proc.
> 
> What change are you talking about here exactly? The "change relatime to 
> have a 24 hour safeguard" of Matthes's first patch or the "enable 
> relatime by default" options in the second patch?
> 
> For the first I don't think it's that big a deal as it is a change that 
> makes the behavior of relatime safer and not riskier. Also, it's 
> something people have argued should have been part of the initial 
> functionality of relatime (it was part of the discussion back then), and 
> finally for a lot of users it's already current functionality as major 
> distros already do include the patch.
> 
> For the second, I can see your point and can understand reservations to 
> make enabling relatime a kernel config option.
> 
> Speaking exclusively for myself, I would be happy enough if only the first 
> of Matthew's patches would get accepted.
> 

Oh, the feature itself is desirable.  But the interface isn't.

- It's a magic number.  Maybe someone runs tmpwatch twice per day, or
  weekly, or...

- That's fixable by making "24" tunable, but it's still a global
  thing.  Better to make it per-fs.

- mount(8) is the standard way of tuning fs behaviour.  There's no
  need to deviate from that here.

Note that none of this involves the default setting.  With a per-mount
tunable we can still make the default for each fs be "on, 24 hours"
if we so decide.


^ permalink raw reply	[flat|nested] 664+ messages in thread

* [PATCH 1/2] Add a strictatime mount option
  2009-03-26 17:16                                           ` Linus Torvalds
@ 2009-03-26 17:49                                             ` Matthew Garrett
  2009-03-26 17:53                                               ` [PATCH 2/2] Make relatime default Matthew Garrett
  2009-03-26 18:52                                               ` [PATCH 1/2] Add a strictatime mount option Alan Cox
  2009-03-26 18:59                                             ` ext3 IO latency measurements (was: Linux 2.6.29) Alan Cox
  1 sibling, 2 replies; 664+ messages in thread
From: Matthew Garrett @ 2009-03-26 17:49 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Theodore Tso, Ingo Molnar, Jan Kara, Andrew Morton, Alan Cox,
	Arjan van de Ven, Peter Zijlstra, Nick Piggin, Jens Axboe,
	David Rees, Jesper Krogh, Linux Kernel Mailing List,
	Oleg Nesterov, Roland McGrath

Add support for explicitly requesting full atime updates. This makes it
possible for kernels to default to relatime but still allow userspace to
override it.

Signed-off-by: Matthew Garrett <mjg@redhat.com>
---
 fs/namespace.c        |    6 +++++-
 include/linux/fs.h    |    1 +
 include/linux/mount.h |    1 +
 3 files changed, 7 insertions(+), 1 deletions(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index 06f8e63..d0659ec 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -780,6 +780,7 @@ static void show_mnt_opts(struct seq_file *m, struct vfsmount *mnt)
 		{ MNT_NOATIME, ",noatime" },
 		{ MNT_NODIRATIME, ",nodiratime" },
 		{ MNT_RELATIME, ",relatime" },
+		{ MNT_STRICTATIME, ",strictatime" },
 		{ 0, NULL }
 	};
 	const struct proc_fs_info *fs_infop;
@@ -1932,11 +1933,14 @@ long do_mount(char *dev_name, char *dir_name, char *type_page,
 		mnt_flags |= MNT_NODIRATIME;
 	if (flags & MS_RELATIME)
 		mnt_flags |= MNT_RELATIME;
+	if (flags & MS_STRICTATIME)
+		mnt_flags &= ~(MNT_RELATIME | MNT_NOATIME);
 	if (flags & MS_RDONLY)
 		mnt_flags |= MNT_READONLY;
 
 	flags &= ~(MS_NOSUID | MS_NOEXEC | MS_NODEV | MS_ACTIVE |
-		   MS_NOATIME | MS_NODIRATIME | MS_RELATIME| MS_KERNMOUNT);
+		   MS_NOATIME | MS_NODIRATIME | MS_RELATIME| MS_KERNMOUNT |
+		   MS_STRICTATIME);
 
 	/* ... and get the mountpoint */
 	retval = kern_path(dir_name, LOOKUP_FOLLOW, &path);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 92734c0..5bc81c4 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -141,6 +141,7 @@ struct inodes_stat_t {
 #define MS_RELATIME	(1<<21)	/* Update atime relative to mtime/ctime. */
 #define MS_KERNMOUNT	(1<<22) /* this is a kern_mount call */
 #define MS_I_VERSION	(1<<23) /* Update inode I_version field */
+#define MS_STRICTATIME	(1<<24) /* Always perform atime updates */
 #define MS_ACTIVE	(1<<30)
 #define MS_NOUSER	(1<<31)
 
diff --git a/include/linux/mount.h b/include/linux/mount.h
index cab2a85..51f55f9 100644
--- a/include/linux/mount.h
+++ b/include/linux/mount.h
@@ -27,6 +27,7 @@ struct mnt_namespace;
 #define MNT_NODIRATIME	0x10
 #define MNT_RELATIME	0x20
 #define MNT_READONLY	0x40	/* does the user want this to be r/o? */
+#define MNT_STRICTATIME 0x80
 
 #define MNT_SHRINKABLE	0x100
 #define MNT_IMBALANCED_WRITE_COUNT	0x200 /* just for debugging */

-- 
Matthew Garrett | mjg59@srcf.ucam.org

^ permalink raw reply related	[flat|nested] 664+ messages in thread

* [PATCH 2/2] Make relatime default
  2009-03-26 17:49                                             ` [PATCH 1/2] Add a strictatime mount option Matthew Garrett
@ 2009-03-26 17:53                                               ` Matthew Garrett
  2009-03-26 18:48                                                 ` Alan Cox
  2009-03-26 18:52                                               ` [PATCH 1/2] Add a strictatime mount option Alan Cox
  1 sibling, 1 reply; 664+ messages in thread
From: Matthew Garrett @ 2009-03-26 17:53 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Theodore Tso, Ingo Molnar, Jan Kara, Andrew Morton, Alan Cox,
	Arjan van de Ven, Peter Zijlstra, Nick Piggin, Jens Axboe,
	David Rees, Jesper Krogh, Linux Kernel Mailing List,
	Oleg Nesterov, Roland McGrath

Change the default behaviour of the kernel to use relatime for all
filesystems. This can be overridden with the "strictatime" mount
option.

Signed-off-by: Matthew Garrett <mjg@redhat.com>
---
 fs/namespace.c |    5 +++--
 1 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index d0659ec..f0e7530 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -1920,6 +1920,9 @@ long do_mount(char *dev_name, char *dir_name, char *type_page,
 	if (data_page)
 		((char *)data_page)[PAGE_SIZE - 1] = 0;
 
+	/* Default to relatime */
+	mnt_flags |= MNT_RELATIME;
+
 	/* Separate the per-mountpoint flags */
 	if (flags & MS_NOSUID)
 		mnt_flags |= MNT_NOSUID;
@@ -1931,8 +1934,6 @@ long do_mount(char *dev_name, char *dir_name, char *type_page,
 		mnt_flags |= MNT_NOATIME;
 	if (flags & MS_NODIRATIME)
 		mnt_flags |= MNT_NODIRATIME;
-	if (flags & MS_RELATIME)
-		mnt_flags |= MNT_RELATIME;
 	if (flags & MS_STRICTATIME)
 		mnt_flags &= ~(MNT_RELATIME | MNT_NOATIME);
 	if (flags & MS_RDONLY)

-- 
Matthew Garrett | mjg59@srcf.ucam.org

^ permalink raw reply related	[flat|nested] 664+ messages in thread

* Re: [PATCH] Allow relatime to update atime once a day
  2009-03-26 17:32                                             ` [PATCH] Allow relatime to update atime once a day Matthew Garrett
@ 2009-03-26 17:56                                               ` Alexey Dobriyan
  2009-03-26 18:55                                               ` Alan Cox
  1 sibling, 0 replies; 664+ messages in thread
From: Alexey Dobriyan @ 2009-03-26 17:56 UTC (permalink / raw)
  To: Matthew Garrett
  Cc: Linus Torvalds, Andrew Morton, Frans Pop, mingo, tytso, jack,
	alan, arjan, a.p.zijlstra, npiggin, jens.axboe, drees76, jesper,
	linux-kernel, oleg, roland, willy, vaurora

On Thu, Mar 26, 2009 at 05:32:14PM +0000, Matthew Garrett wrote:
> Allow atime to be updated once per day even with relatime. This lets
> utilities like tmpreaper (which delete files based on last access time)
> continue working, making relatime a plausible default for distributions.

> +/*
> + * With relative atime, only update atime if the previous atime is
> + * earlier than either the ctime or mtime or if at least a day has
> + * passed since the last atime update.
> + */
> +static int relatime_need_update(struct vfsmount *mnt, struct inode *inode,
> +			     struct timespec now)
> +{
> +
> +	if (!(mnt->mnt_flags & MNT_RELATIME))
> +		return 1;
> +	/*
> +	 * Is mtime younger than atime? If yes, update atime:
> +	 */
> +	if (timespec_compare(&inode->i_mtime, &inode->i_atime) >= 0)
> +		return 1;
> +	/*
> +	 * Is ctime younger than atime? If yes, update atime:
> +	 */
> +	if (timespec_compare(&inode->i_ctime, &inode->i_atime) >= 0)
> +		return 1;
> +
> +	/*
> +	 * Is the previous atime value older than a day? If yes,
> +	 * update atime:
> +	 */
> +	if ((long)(now.tv_sec - inode->i_atime.tv_sec) >= 24*60*60)
> +		return 1;
> +	/*
> +	 * Good, we can skip the atime update:
> +	 */
> +	return 0;
> +}

Good example of overcommented code.

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: ext3 IO latency measurements (was: Linux 2.6.29)
  2009-03-26  9:06                               ` ext3 IO latency measurements (was: Linux 2.6.29) Ingo Molnar
                                                   ` (4 preceding siblings ...)
  2009-03-26 14:38                                 ` Andrew Morton
@ 2009-03-26 18:11                                 ` Jan Kara
  2009-03-26 18:35                                   ` Andrew Morton
  2009-03-26 22:39                                   ` Linus Torvalds
  2009-04-09 21:59                                 ` updated: ext3 IO latency measurements on v2.6.30-rc1 Ingo Molnar
  6 siblings, 2 replies; 664+ messages in thread
From: Jan Kara @ 2009-03-26 18:11 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, Theodore Tso, Andrew Morton, Alan Cox,
	Arjan van de Ven, Peter Zijlstra, Nick Piggin, Jens Axboe,
	David Rees, Jesper Krogh, Linux Kernel Mailing List,
	Oleg Nesterov, Roland McGrath

[-- Attachment #1: Type: text/plain, Size: 5997 bytes --]

On Thu 26-03-09 10:06:30, Ingo Molnar wrote:
> 
> * Jan Kara <jack@suse.cz> wrote:
> 
> > > So tell me again how the VM can rely on the filesystem not 
> > > blocking at random points.
> >
> >   I can write a patch to make writepage() in the non-"mmapped 
> > creation" case non-blocking on journal. But I'll also have to find 
> > out whether it really helps something. But it's probably worth 
> > trying...
> 
> _all_ the problems i ever had with ext3 were 'collateral damage' 
> type of things: simple writes (sometimes even reads) getting 
> serialized on some large [but reasonable] dirtying activity 
> elsewhere - even if the system was still well within its 
> hard-dirty-limit threshold.
> 
> So it sure sounds like an area worth improving, and it's not that 
> hard to reproduce either. Take a system with enough RAM but only a 
> single disk, and do this in a kernel tree:
> 
>   sync
>   echo 3 > /proc/sys/vm/drop_caches
> 
>   while :; do
>     date
>     make mrproper      2>/dev/null >/dev/null
>     make defconfig     2>/dev/null >/dev/null
>     make -j32 bzImage  2>/dev/null >/dev/null
>   done &
  I've played with it a bit. I don't have a fast enough machine so that a
compile would feed my SATA drive fast enough (and I also have just 2 GB of
memory) but copying kernel tree there and back seemed to load it reasonably.
  I've tried a kernel with and without attached patch which makes writepage
not start a transaction when not needed.
 
> Plain old kernel build, no distcc and no icecream. Wait a few 
> minutes for the system to reach equilibrium. There's no tweaking 
> anywhere, kernel, distro and filesystem defaults used everywhere:
> 
>  aldebaran:/home/mingo/linux/linux> ./compile-test 
>  Thu Mar 26 10:33:03 CET 2009
>  Thu Mar 26 10:35:24 CET 2009
>  Thu Mar 26 10:36:48 CET 2009
>  Thu Mar 26 10:38:54 CET 2009
>  Thu Mar 26 10:41:22 CET 2009
>  Thu Mar 26 10:43:41 CET 2009
>  Thu Mar 26 10:46:02 CET 2009
>  Thu Mar 26 10:48:28 CET 2009
> 
> And try to use the system while this workload is going on. Use Vim 
> to edit files in this kernel tree. Use plain _cat_ - and i hit 
> delays all the time - and it's not the CPU scheduler but all IO 
> related.
  So I observed long delays when VIM was saving a file but in all cases it
was hanging in fsync() which was committing a large transaction (this was
both with and without patch) - not a big surprise. Working on the machine
seemed a bit better when the patch was applied - in the kernel with the
patch VIM at least didn't hang when just writing into the file.
  Reads are measurably better with the patch - the test with cat you
describe below took ~0.5s per file without the patch and always less than
0.02s with the patch. So it seems to help something. Can you check on your
machine whether you see some improvements? Thanks.

> I have such an ext3 based system where i can do such tests and where 
> i dont mind crashes and data corruption either, so if you send me 
> experimental patches against latet -git i can try them immediately. 
> The system has 16 CPUs, 12GB of RAM and a single disk.
> 
> Btw., i had this test going on that box while i wrote some simple 
> scripts in Vim - and it was a horrible experience. The worst wait 
> was well above one minute - Vim just hung there indefinitely. Not 
> even Ctrl-Z was possible. I captured one such wait, it was hanging 
> right here:
> 
>  aldebaran:~/linux/linux> cat /proc/3742/stack
>  [<ffffffff8034790a>] log_wait_commit+0xbd/0x110
>  [<ffffffff803430b2>] journal_stop+0x1df/0x20d
>  [<ffffffff8034421f>] journal_force_commit+0x28/0x2d
>  [<ffffffff80331c69>] ext3_force_commit+0x2b/0x2d
>  [<ffffffff80328b56>] ext3_write_inode+0x3e/0x44
>  [<ffffffff802ebb9d>] __sync_single_inode+0xc1/0x2ad
>  [<ffffffff802ebed6>] __writeback_single_inode+0x14d/0x15a
>  [<ffffffff802ebf0c>] sync_inode+0x29/0x34
>  [<ffffffff80327453>] ext3_sync_file+0xa7/0xb4
>  [<ffffffff802ef17d>] vfs_fsync+0x78/0xaf
>  [<ffffffff802ef1eb>] do_fsync+0x37/0x4d
>  [<ffffffff802ef228>] sys_fsync+0x10/0x14
>  [<ffffffff8020bd1b>] system_call_fastpath+0x16/0x1b
>  [<ffffffffffffffff>] 0xffffffffffffffff
> 
> It took about 120 seconds for it to recover.
> 
> And it's not just sys_fsync(). The script i wrote tests file read 
> latencies. I have created 1000 files with the same size (all copies 
> of kernel/sched.c ;-), and tested their cache-cold plain-cat 
> performance via:
> 
>   for ((i=0;i<1000;i++)); do
>     printf "file #%4d, plain reading it took: " $i
>     /usr/bin/time -f "%e seconds."  cat $i >/dev/null
>   done
> 
> I.e. plain, supposedly high-prio reads. The result is very common 
> hickups in read latencies:
> 
> file # 579 (253560 bytes), reading it took: 0.08 seconds.
> file # 580 (253560 bytes), reading it took: 0.05 seconds.
> file # 581 (253560 bytes), reading it took: 0.01 seconds.
> file # 582 (253560 bytes), reading it took: 0.01 seconds.
> file # 583 (253560 bytes), reading it took: 4.61 seconds.
> file # 584 (253560 bytes), reading it took: 1.29 seconds.
> file # 585 (253560 bytes), reading it took: 3.01 seconds.
> file # 586 (253560 bytes), reading it took: 7.74 seconds.
> file # 587 (253560 bytes), reading it took: 3.22 seconds.
> file # 588 (253560 bytes), reading it took: 0.05 seconds.
> file # 589 (253560 bytes), reading it took: 0.36 seconds.
> file # 590 (253560 bytes), reading it took: 7.39 seconds.
> file # 591 (253560 bytes), reading it took: 7.58 seconds.
> file # 592 (253560 bytes), reading it took: 7.90 seconds.
> file # 593 (253560 bytes), reading it took: 8.78 seconds.
> file # 594 (253560 bytes), reading it took: 8.01 seconds.
> file # 595 (253560 bytes), reading it took: 7.47 seconds.
> file # 596 (253560 bytes), reading it took: 11.52 seconds.
> file # 597 (253560 bytes), reading it took: 10.33 seconds.
> file # 598 (253560 bytes), reading it took: 8.56 seconds.
> file # 599 (253560 bytes), reading it took: 7.58 seconds.
  <snip>

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

[-- Attachment #2: 0001-ext3-Avoid-starting-a-transaction-in-writepage-when.patch --]
[-- Type: text/x-patch, Size: 2268 bytes --]

>From 1bf84d0f6162196b4c0d83e9db1ee11507a8f91f Mon Sep 17 00:00:00 2001
From: Jan Kara <jack@suse.cz>
Date: Thu, 26 Mar 2009 13:08:04 +0100
Subject: [PATCH] ext3: Avoid starting a transaction in writepage when not necessary

We don't have to start a transaction in writepage() when all the blocks
are a properly allocated. Even in ordered mode either the data has been
written via write() and they are thus already added to transaction's list
or the data was written via mmap and then it's random in which transaction
they get written anyway.

This should help VM to pageout dirty memory without blocking on transaction
commits.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ext3/inode.c |   19 ++++++++++++++-----
 1 files changed, 14 insertions(+), 5 deletions(-)

diff --git a/fs/ext3/inode.c b/fs/ext3/inode.c
index e230f7a..61bce1a 100644
--- a/fs/ext3/inode.c
+++ b/fs/ext3/inode.c
@@ -1420,6 +1420,10 @@ static int bput_one(handle_t *handle, struct buffer_head *bh)
 	return 0;
 }
 
+static int buffer_unmapped(handle_t *handle, struct buffer_head *bh)
+{
+	return !buffer_mapped(bh);
+}
 /*
  * Note that we always start a transaction even if we're not journalling
  * data.  This is to preserve ordering: any hole instantiation within
@@ -1490,6 +1494,16 @@ static int ext3_ordered_writepage(struct page *page,
 	if (ext3_journal_current_handle())
 		goto out_fail;
 
+	if (!page_has_buffers(page)) {
+		create_empty_buffers(page, inode->i_sb->s_blocksize,
+				(1 << BH_Dirty)|(1 << BH_Uptodate));
+	} else if (!walk_page_buffers(NULL, page_buffers(page), 0, PAGE_CACHE_SIZE, NULL, buffer_unmapped)) {
+		/* Provide NULL instead of get_block so that we catch bugs if buffers weren't really mapped */
+		return block_write_full_page(page, NULL, wbc);
+	}
+	page_bufs = page_buffers(page);
+
+
 	handle = ext3_journal_start(inode, ext3_writepage_trans_blocks(inode));
 
 	if (IS_ERR(handle)) {
@@ -1497,11 +1511,6 @@ static int ext3_ordered_writepage(struct page *page,
 		goto out_fail;
 	}
 
-	if (!page_has_buffers(page)) {
-		create_empty_buffers(page, inode->i_sb->s_blocksize,
-				(1 << BH_Dirty)|(1 << BH_Uptodate));
-	}
-	page_bufs = page_buffers(page);
 	walk_page_buffers(handle, page_bufs, 0,
 			PAGE_CACHE_SIZE, NULL, bget_one);
 
-- 
1.6.0.2


^ permalink raw reply related	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-26 15:46                                       ` Jens Axboe
@ 2009-03-26 18:21                                         ` Hugh Dickins
  2009-03-26 18:32                                           ` Jens Axboe
  0 siblings, 1 reply; 664+ messages in thread
From: Hugh Dickins @ 2009-03-26 18:21 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Ric Wheeler, Jeff Garzik, Linus Torvalds, Theodore Tso,
	Ingo Molnar, Alan Cox, Arjan van de Ven, Andrew Morton,
	Peter Zijlstra, Nick Piggin, David Rees, Jesper Krogh,
	Linux Kernel Mailing List

On Thu, 26 Mar 2009, Jens Axboe wrote:
> On Thu, Mar 26 2009, Hugh Dickins wrote:
> > On Thu, 26 Mar 2009, Jens Axboe wrote:
> > > On Wed, Mar 25 2009, Hugh Dickins wrote:
> > > > 
> > > > Tangential question, but am I right in thinking that BIO_RW_BARRIER
> > > > similarly bars across all partitions, whereas its WRITE_BARRIER and
> > > > DISCARD_BARRIER users would actually prefer it to apply to just one?
> > > 
> > > All the barriers refer to just that range which the barrier itself
> > > references.
> > 
> > Ah, thank you: then I had a fundamental misunderstanding of them,
> > and need to go away and work that out some more.
> > 
> > Though I didn't read it before asking, doesn't the I/O Barriers section
> > of Documentation/block/biodoc.txt give a very different impression?
> 
> I'm sensing a miscommunication here... The ordering constraint is across
> devices, at least that is how it is implemented. For file system
> barriers (like BIO_RW_BARRIER), it could be per-partition instead. Doing
> so would involve some changes at the block layer side, not necessarily
> trivial. So I think you were asking about ordering, I was answering
> about the write guarantee :-)

Ah, thank you again, perhaps I did understand after all.

So, directing a barrier (WRITE_BARRIER or DISCARD_BARRIER) to a range
of sectors in one partition interposes a barrier into the queue of I/O
across (all partitions of) that whole device.

I think that's not how filesystems really want barriers to behave,
and might tend to discourage us from using barriers more freely.
But I have zero appreciation of whether it's a significant issue
worth non-trivial change - just wanted to get it out into the open.

Hugh

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: ext3 IO latency measurements (was: Linux 2.6.29)
  2009-03-26 17:41                                           ` Linus Torvalds
@ 2009-03-26 18:23                                             ` Bill Nottingham
  2009-03-26 22:24                                               ` Linus Torvalds
  0 siblings, 1 reply; 664+ messages in thread
From: Bill Nottingham @ 2009-03-26 18:23 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Theodore Tso, Ingo Molnar, Jan Kara, Andrew Morton, Alan Cox,
	Arjan van de Ven, Peter Zijlstra, Nick Piggin, Jens Axboe,
	David Rees, Jesper Krogh, Linux Kernel Mailing List,
	Oleg Nesterov, Roland McGrath

Linus Torvalds (torvalds@linux-foundation.org) said: 
> > Linus Torvalds (torvalds@linux-foundation.org) said: 
> > > And quite frankly, even if you then _manually_ put 'relatime' in 
> > > /etc/fstab, the default Fedora install will totally ignore it. Why? 
> > > Because it mounts the root partition while using initrd, and totally 
> > > ignores /etc/fstab.
> > 
> > It should honor /etc/fstab changes, if the initramfs is rebuilt
> > after the change is made. If it doesn't, that's a bug.
> 
> Why the hell should I rebuild initramfs? 

Well, it's got to find the root fs options somewhere. Pulling them
from the modified /etc/fstab in the root fs before you mount it, well...

As for why fstab options aren't applied with remount once the root
fs has been mounted, 1) historical reasons 2) someone specifies
'data=writeback' or similar can't-be-applied-with-remount flag in
/etc/fstab, and then mount refuses to remount it at all, and the
system refuses to boot. Arguably pilot error, of course.

Bill

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-26 18:21                                         ` Hugh Dickins
@ 2009-03-26 18:32                                           ` Jens Axboe
  2009-03-26 19:00                                             ` Hugh Dickins
  0 siblings, 1 reply; 664+ messages in thread
From: Jens Axboe @ 2009-03-26 18:32 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Ric Wheeler, Jeff Garzik, Linus Torvalds, Theodore Tso,
	Ingo Molnar, Alan Cox, Arjan van de Ven, Andrew Morton,
	Peter Zijlstra, Nick Piggin, David Rees, Jesper Krogh,
	Linux Kernel Mailing List

On Thu, Mar 26 2009, Hugh Dickins wrote:
> On Thu, 26 Mar 2009, Jens Axboe wrote:
> > On Thu, Mar 26 2009, Hugh Dickins wrote:
> > > On Thu, 26 Mar 2009, Jens Axboe wrote:
> > > > On Wed, Mar 25 2009, Hugh Dickins wrote:
> > > > > 
> > > > > Tangential question, but am I right in thinking that BIO_RW_BARRIER
> > > > > similarly bars across all partitions, whereas its WRITE_BARRIER and
> > > > > DISCARD_BARRIER users would actually prefer it to apply to just one?
> > > > 
> > > > All the barriers refer to just that range which the barrier itself
> > > > references.
> > > 
> > > Ah, thank you: then I had a fundamental misunderstanding of them,
> > > and need to go away and work that out some more.
> > > 
> > > Though I didn't read it before asking, doesn't the I/O Barriers section
> > > of Documentation/block/biodoc.txt give a very different impression?
> > 
> > I'm sensing a miscommunication here... The ordering constraint is across
> > devices, at least that is how it is implemented. For file system
> > barriers (like BIO_RW_BARRIER), it could be per-partition instead. Doing
> > so would involve some changes at the block layer side, not necessarily
> > trivial. So I think you were asking about ordering, I was answering
> > about the write guarantee :-)
> 
> Ah, thank you again, perhaps I did understand after all.
> 
> So, directing a barrier (WRITE_BARRIER or DISCARD_BARRIER) to a range
> of sectors in one partition interposes a barrier into the queue of I/O
> across (all partitions of) that whole device.

Correct

> I think that's not how filesystems really want barriers to behave,
> and might tend to discourage us from using barriers more freely.
> But I have zero appreciation of whether it's a significant issue
> worth non-trivial change - just wanted to get it out into the open.

Per-partition definitely makes sense. The problem is that we do sorting
on a per-device basis right now. But it's a good point, I'll try and
take a look at how much work it would be to make it per-partition
instead. It wont be trivial :-)

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: ext3 IO latency measurements (was: Linux 2.6.29)
  2009-03-26 18:11                                 ` Jan Kara
@ 2009-03-26 18:35                                   ` Andrew Morton
  2009-03-27 21:26                                     ` Jan Kara
  2009-03-26 22:39                                   ` Linus Torvalds
  1 sibling, 1 reply; 664+ messages in thread
From: Andrew Morton @ 2009-03-26 18:35 UTC (permalink / raw)
  To: Jan Kara
  Cc: Ingo Molnar, Linus Torvalds, Theodore Tso, Alan Cox,
	Arjan van de Ven, Peter Zijlstra, Nick Piggin, Jens Axboe,
	David Rees, Jesper Krogh, Linux Kernel Mailing List,
	Oleg Nesterov, Roland McGrath


The patch looks OK to me.

On Thu, 26 Mar 2009 19:11:06 +0100 Jan Kara <jack@suse.cz> wrote:

> @@ -1490,6 +1494,16 @@ static int ext3_ordered_writepage(struct page *page,
>  	if (ext3_journal_current_handle())
>  		goto out_fail;
>  
> +	if (!page_has_buffers(page)) {
> +		create_empty_buffers(page, inode->i_sb->s_blocksize,
> +				(1 << BH_Dirty)|(1 << BH_Uptodate));

This will attach dirty buffers to a clean page, which is an invalid
state (but OK if we immediately fix it up).


> +	} else if (!walk_page_buffers(NULL, page_buffers(page), 0, PAGE_CACHE_SIZE, NULL, buffer_unmapped)) {
> +		/* Provide NULL instead of get_block so that we catch bugs if buffers weren't really mapped */
> +		return block_write_full_page(page, NULL, wbc);
> +	}
> +	page_bufs = page_buffers(page);
> +
> +
>  	handle = ext3_journal_start(inode, ext3_writepage_trans_blocks(inode));
>  
>  	if (IS_ERR(handle)) {

And if this error happens we'll go on to run
redirty_page_for_writepage() which will do the right thing.

However if PageMappedToDisk() is working right, we should be able to
avoid that newly-added buffer walk.  Possibly SetPageMappedToDisk()
isn't being run in all the right places though, dunno.



^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: [PATCH 2/2] Make relatime default
  2009-03-26 17:53                                               ` [PATCH 2/2] Make relatime default Matthew Garrett
@ 2009-03-26 18:48                                                 ` Alan Cox
  2009-03-26 22:27                                                   ` Linus Torvalds
  2009-03-30 14:42                                                   ` Andrea Arcangeli
  0 siblings, 2 replies; 664+ messages in thread
From: Alan Cox @ 2009-03-26 18:48 UTC (permalink / raw)
  To: Matthew Garrett
  Cc: Linus Torvalds, Theodore Tso, Ingo Molnar, Jan Kara,
	Andrew Morton, Arjan van de Ven, Peter Zijlstra, Nick Piggin,
	Jens Axboe, David Rees, Jesper Krogh, Linux Kernel Mailing List,
	Oleg Nesterov, Roland McGrath

On Thu, 26 Mar 2009 17:53:14 +0000
Matthew Garrett <mjg@redhat.com> wrote:

> Change the default behaviour of the kernel to use relatime for all
> filesystems. This can be overridden with the "strictatime" mount
> option.

NAK this again

There is an expected behaviour pattern that is standards compliant and
suddenly breaking that on people when they upgrade could cause serious
problems in some server environments.

What you propose is basically a bogus ABI change. Fix it in user space
(in fact all the distros *are* so this patch is silly and pointless)

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: relatime: update once per day patches (was: ext3 IO latency measurements)
  2009-03-26 17:48                                                 ` Andrew Morton
@ 2009-03-26 18:49                                                   ` Matthew Garrett
  2009-03-26 19:20                                                     ` Andrew Morton
  0 siblings, 1 reply; 664+ messages in thread
From: Matthew Garrett @ 2009-03-26 18:49 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Frans Pop, Linus Torvalds, mingo, tytso, jack, alan, arjan,
	a.p.zijlstra, npiggin, jens.axboe, drees76, jesper, linux-kernel,
	oleg, roland

On Thu, Mar 26, 2009 at 10:48:43AM -0700, Andrew Morton wrote:

> Oh, the feature itself is desirable.  But the interface isn't.
> 
> - It's a magic number.  Maybe someone runs tmpwatch twice per day, or
>   weekly, or...

There has to be a default. 24 hours is a sensible one.

> - That's fixable by making "24" tunable, but it's still a global
>   thing.  Better to make it per-fs.

Patches welcome.

> - mount(8) is the standard way of tuning fs behaviour.  There's no
>   need to deviate from that here.

Patches welcome.

When did we adopt a mindset that led to code having to satisfy every 
single user requirement before being accepted, rather than being happy 
with code that provides an incremental improvement over what exists 
already? If there are actually users who want to be able to tune this 
per filesystem then I'm sure someone (possibly even me) will write code 
to support them, but right now it just sounds like features for the sake 
of some sense of aesthetic correctness.

-- 
Matthew Garrett | mjg59@srcf.ucam.org

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: [PATCH 1/2] Add a strictatime mount option
  2009-03-26 17:49                                             ` [PATCH 1/2] Add a strictatime mount option Matthew Garrett
  2009-03-26 17:53                                               ` [PATCH 2/2] Make relatime default Matthew Garrett
@ 2009-03-26 18:52                                               ` Alan Cox
  1 sibling, 0 replies; 664+ messages in thread
From: Alan Cox @ 2009-03-26 18:52 UTC (permalink / raw)
  To: Matthew Garrett
  Cc: Linus Torvalds, Theodore Tso, Ingo Molnar, Jan Kara,
	Andrew Morton, Arjan van de Ven, Peter Zijlstra, Nick Piggin,
	Jens Axboe, David Rees, Jesper Krogh, Linux Kernel Mailing List,
	Oleg Nesterov, Roland McGrath

On Thu, 26 Mar 2009 17:49:56 +0000
Matthew Garrett <mjg@redhat.com> wrote:

> Add support for explicitly requesting full atime updates. This makes it
> possible for kernels to default to relatime but still allow userspace to
> override it.
> 
> Signed-off-by: Matthew Garrett <mjg@redhat.com>

NAK this is unneccessary complication from a broken ABI change that isn't
safe to make anyway.


^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: ext3 IO latency measurements (was: Linux 2.6.29)
  2009-03-26 17:32                                         ` Bill Nottingham
  2009-03-26 17:41                                           ` Linus Torvalds
@ 2009-03-26 18:54                                           ` Alan Cox
  1 sibling, 0 replies; 664+ messages in thread
From: Alan Cox @ 2009-03-26 18:54 UTC (permalink / raw)
  To: Bill Nottingham
  Cc: Linus Torvalds, Theodore Tso, Ingo Molnar, Jan Kara,
	Andrew Morton, Arjan van de Ven, Peter Zijlstra, Nick Piggin,
	Jens Axboe, David Rees, Jesper Krogh, Linux Kernel Mailing List,
	Oleg Nesterov, Roland McGrath

On Thu, 26 Mar 2009 13:32:38 -0400
Bill Nottingham <notting@redhat.com> wrote:

> Linus Torvalds (torvalds@linux-foundation.org) said: 
> > And quite frankly, even if you then _manually_ put 'relatime' in 
> > /etc/fstab, the default Fedora install will totally ignore it. Why? 
> > Because it mounts the root partition while using initrd, and totally 
> > ignores /etc/fstab.
> 
> It should honor /etc/fstab changes, if the initramfs is rebuilt
> after the change is made. If it doesn't, that's a bug.

Surely it should also look at the real /etc/fstab after mounting root r/o
and then flip the options needed so you don't have to.

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: [PATCH] Allow relatime to update atime once a day
  2009-03-26 17:32                                             ` [PATCH] Allow relatime to update atime once a day Matthew Garrett
  2009-03-26 17:56                                               ` Alexey Dobriyan
@ 2009-03-26 18:55                                               ` Alan Cox
  1 sibling, 0 replies; 664+ messages in thread
From: Alan Cox @ 2009-03-26 18:55 UTC (permalink / raw)
  To: Matthew Garrett
  Cc: Linus Torvalds, Andrew Morton, Frans Pop, mingo, tytso, jack,
	arjan, a.p.zijlstra, npiggin, jens.axboe, drees76, jesper,
	linux-kernel, oleg, roland, willy, vaurora

On Thu, 26 Mar 2009 17:32:14 +0000
Matthew Garrett <mjg@redhat.com> wrote:

> Allow atime to be updated once per day even with relatime. This lets
> utilities like tmpreaper (which delete files based on last access time)
> continue working, making relatime a plausible default for distributions.

And while I think forcing relatime on is a really dumb dangerous idea,
providing it so you can enable it (or distro new releases can for new
installs etc) is a *very good* one

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: ext3 IO latency measurements (was: Linux 2.6.29)
  2009-03-26 17:16                                           ` Linus Torvalds
  2009-03-26 17:49                                             ` [PATCH 1/2] Add a strictatime mount option Matthew Garrett
@ 2009-03-26 18:59                                             ` Alan Cox
  2009-03-26 20:02                                               ` Matthew Garrett
  2009-03-27 12:00                                               ` ext3 IO latency measurements Giacomo A. Catenazzi
  1 sibling, 2 replies; 664+ messages in thread
From: Alan Cox @ 2009-03-26 18:59 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Theodore Tso, Ingo Molnar, Jan Kara, Andrew Morton,
	Arjan van de Ven, Peter Zijlstra, Nick Piggin, Jens Axboe,
	David Rees, Jesper Krogh, Linux Kernel Mailing List,
	Oleg Nesterov, Roland McGrath

> And what's the argument for not doing it in the kernel?
> 
> The fact is, "atime" by default is just wrong.

It probably was a wrong default - twenty years ago. Actually it may well
have been a wrong default in Unix v6 8)

However
- atime behaviour is SuS required
- there are users with systems out there using atime and dependant on
  proper atime

So we can't change the ABI on them any more than we can decide that next
week write() should return short values on writes to disk interrupted by
signals...

Letting distros flip to relatime means new installs and gradual migration
occurs and nobody gets spectacularly blown up when their archiving
system, their usage profiling and disk balancing tools and the like go
wrong.

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-26 18:32                                           ` Jens Axboe
@ 2009-03-26 19:00                                             ` Hugh Dickins
  2009-03-26 19:03                                               ` Jens Axboe
  0 siblings, 1 reply; 664+ messages in thread
From: Hugh Dickins @ 2009-03-26 19:00 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Ric Wheeler, Jeff Garzik, Linus Torvalds, Theodore Tso,
	Ingo Molnar, Alan Cox, Arjan van de Ven, Andrew Morton,
	Peter Zijlstra, Nick Piggin, David Rees, Jesper Krogh,
	Linux Kernel Mailing List

On Thu, 26 Mar 2009, Jens Axboe wrote:
> On Thu, Mar 26 2009, Hugh Dickins wrote:
> > 
> > So, directing a barrier (WRITE_BARRIER or DISCARD_BARRIER) to a range
> > of sectors in one partition interposes a barrier into the queue of I/O
> > across (all partitions of) that whole device.
> 
> Correct
> 
> > I think that's not how filesystems really want barriers to behave,
> > and might tend to discourage us from using barriers more freely.
> > But I have zero appreciation of whether it's a significant issue
> > worth non-trivial change - just wanted to get it out into the open.
> 
> Per-partition definitely makes sense. The problem is that we do sorting
> on a per-device basis right now. But it's a good point, I'll try and
> take a look at how much work it would be to make it per-partition
> instead. It wont be trivial :-)

Thanks, that would be interesting.
Trivial bores you anyway, doesn't it?

Hugh

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-26 19:00                                             ` Hugh Dickins
@ 2009-03-26 19:03                                               ` Jens Axboe
  0 siblings, 0 replies; 664+ messages in thread
From: Jens Axboe @ 2009-03-26 19:03 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Ric Wheeler, Jeff Garzik, Linus Torvalds, Theodore Tso,
	Ingo Molnar, Alan Cox, Arjan van de Ven, Andrew Morton,
	Peter Zijlstra, Nick Piggin, David Rees, Jesper Krogh,
	Linux Kernel Mailing List

On Thu, Mar 26 2009, Hugh Dickins wrote:
> On Thu, 26 Mar 2009, Jens Axboe wrote:
> > On Thu, Mar 26 2009, Hugh Dickins wrote:
> > > 
> > > So, directing a barrier (WRITE_BARRIER or DISCARD_BARRIER) to a range
> > > of sectors in one partition interposes a barrier into the queue of I/O
> > > across (all partitions of) that whole device.
> > 
> > Correct
> > 
> > > I think that's not how filesystems really want barriers to behave,
> > > and might tend to discourage us from using barriers more freely.
> > > But I have zero appreciation of whether it's a significant issue
> > > worth non-trivial change - just wanted to get it out into the open.
> > 
> > Per-partition definitely makes sense. The problem is that we do sorting
> > on a per-device basis right now. But it's a good point, I'll try and
> > take a look at how much work it would be to make it per-partition
> > instead. It wont be trivial :-)
> 
> Thanks, that would be interesting.
> Trivial bores you anyway, doesn't it?

You're a good motivator, Hugh!

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: relatime: update once per day patches (was: ext3 IO latency measurements)
  2009-03-26 18:49                                                   ` Matthew Garrett
@ 2009-03-26 19:20                                                     ` Andrew Morton
  2009-03-26 19:43                                                       ` Matthew Garrett
  2009-03-27 11:25                                                       ` David Hagood
  0 siblings, 2 replies; 664+ messages in thread
From: Andrew Morton @ 2009-03-26 19:20 UTC (permalink / raw)
  To: Matthew Garrett
  Cc: Frans Pop, Linus Torvalds, mingo, tytso, jack, alan, arjan,
	a.p.zijlstra, npiggin, jens.axboe, drees76, jesper, linux-kernel,
	oleg, roland

On Thu, 26 Mar 2009 18:49:00 +0000 Matthew Garrett <mjg@redhat.com> wrote:

> On Thu, Mar 26, 2009 at 10:48:43AM -0700, Andrew Morton wrote:
> 
> > Oh, the feature itself is desirable.  But the interface isn't.
> > 
> > - It's a magic number.  Maybe someone runs tmpwatch twice per day, or
> >   weekly, or...
> 
> There has to be a default. 24 hours is a sensible one.

"default" implies that it can be altered by some means.

> When did we adopt a mindset that led to code having to satisfy every 
> single user requirement before being accepted, rather than being happy 
> with code that provides an incremental improvement over what exists 
> already?

Shortcomings have been identified.  Weaselly verbiage is not a suitable
way of addressing shortcomings!

Yes, we could (and do) merge things as a halfway step.  But when the
features are visible to userspace we just can't do that - we have to
get the interface right on day one, because interfaces are for ever.

(Bear in mind that kernel-does-X-once-per-day _is_ an interface)

> If there are actually users who want to be able to tune this 
> per filesystem then I'm sure someone (possibly even me) will write code 
> to support them, but right now it just sounds like features for the sake 
> of some sense of aesthetic correctness.

A hard-wired global 24-hours constant is in no way superior to a
per-mount tunable.  If we're going to do this we should do it in the
best way we know, and we certainly should not lock ourselves into the
inferior implementation for all time by exposing it to userspace.

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: relatime: update once per day patches (was: ext3 IO latency measurements)
  2009-03-26 19:20                                                     ` Andrew Morton
@ 2009-03-26 19:43                                                       ` Matthew Garrett
  2009-03-27 11:25                                                       ` David Hagood
  1 sibling, 0 replies; 664+ messages in thread
From: Matthew Garrett @ 2009-03-26 19:43 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Frans Pop, Linus Torvalds, mingo, tytso, jack, alan, arjan,
	a.p.zijlstra, npiggin, jens.axboe, drees76, jesper, linux-kernel,
	oleg, roland

On Thu, Mar 26, 2009 at 12:20:03PM -0700, Andrew Morton wrote:
> On Thu, 26 Mar 2009 18:49:00 +0000 Matthew Garrett <mjg@redhat.com> wrote:
> > On Thu, Mar 26, 2009 at 10:48:43AM -0700, Andrew Morton wrote:
> > There has to be a default. 24 hours is a sensible one.
> 
> "default" implies that it can be altered by some means.

You need a default even in the tunable case.

> > When did we adopt a mindset that led to code having to satisfy every 
> > single user requirement before being accepted, rather than being happy 
> > with code that provides an incremental improvement over what exists 
> > already?
> 
> Shortcomings have been identified.  Weaselly verbiage is not a suitable
> way of addressing shortcomings!

What shortcomings? So far we have a hypothetical complaint that some 
users will want to choose a different value. Right now they have the 
choice of continuing to not use relatime. Things are no worse for them 
than they were previously.

> > If there are actually users who want to be able to tune this 
> > per filesystem then I'm sure someone (possibly even me) will write code 
> > to support them, but right now it just sounds like features for the sake 
> > of some sense of aesthetic correctness.
> 
> A hard-wired global 24-hours constant is in no way superior to a
> per-mount tunable.  If we're going to do this we should do it in the
> best way we know, and we certainly should not lock ourselves into the
> inferior implementation for all time by exposing it to userspace.

I don't claim that it's superior, merely that it deals with all the use 
cases I've had to worry about and so is good enough. If it turns out 
that there are people in the real world who need the better version then 
I can write that code, but I'm not going to while it's a hypothetical.
-- 
Matthew Garrett | mjg59@srcf.ucam.org

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: ext3 IO latency measurements (was: Linux 2.6.29)
  2009-03-26 18:59                                             ` ext3 IO latency measurements (was: Linux 2.6.29) Alan Cox
@ 2009-03-26 20:02                                               ` Matthew Garrett
  2009-03-26 20:42                                                 ` Alan Cox
  2009-03-27 12:00                                               ` ext3 IO latency measurements Giacomo A. Catenazzi
  1 sibling, 1 reply; 664+ messages in thread
From: Matthew Garrett @ 2009-03-26 20:02 UTC (permalink / raw)
  To: Alan Cox
  Cc: Linus Torvalds, Theodore Tso, Ingo Molnar, Jan Kara,
	Andrew Morton, Arjan van de Ven, Peter Zijlstra, Nick Piggin,
	Jens Axboe, David Rees, Jesper Krogh, Linux Kernel Mailing List,
	Oleg Nesterov, Roland McGrath

On Thu, Mar 26, 2009 at 06:59:00PM +0000, Alan Cox wrote:
> > And what's the argument for not doing it in the kernel?
> > 
> > The fact is, "atime" by default is just wrong.
> 
> It probably was a wrong default - twenty years ago. Actually it may well
> have been a wrong default in Unix v6 8)
> 
> However
> - atime behaviour is SuS required

SuS says "An implementation may update fields that are marked for update 
immediately, or it may update such fields periodically. At an update 
point in time, any marked fields shall be set to the current time and 
the update marks shall be cleared" but doesn't appear to specify any 
kind of time limit. A conforming implementation could wait a century 
before performing the update. So while relatime doesn't conform, the 
practical difference is meaningless. You can't depend on atime being 
updated in a timely manner.

-- 
Matthew Garrett | mjg59@srcf.ucam.org

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: ext3 IO latency measurements (was: Linux 2.6.29)
  2009-03-26 20:02                                               ` Matthew Garrett
@ 2009-03-26 20:42                                                 ` Alan Cox
  2009-03-26 20:55                                                   ` Matthew Garrett
  2009-03-26 23:04                                                   ` Bron Gondwana
  0 siblings, 2 replies; 664+ messages in thread
From: Alan Cox @ 2009-03-26 20:42 UTC (permalink / raw)
  To: Matthew Garrett
  Cc: Linus Torvalds, Theodore Tso, Ingo Molnar, Jan Kara,
	Andrew Morton, Arjan van de Ven, Peter Zijlstra, Nick Piggin,
	Jens Axboe, David Rees, Jesper Krogh, Linux Kernel Mailing List,
	Oleg Nesterov, Roland McGrath

> before performing the update. So while relatime doesn't conform, the 
> practical difference is meaningless. You can't depend on atime being 
> updated in a timely manner.

POSIX says a disk write interrupted by a signal can be a short write. If
you do this in practice all hell breaks loose.

A conforming implementation needs to conform with expectations not just
play lawyer games with users systems.

Alan

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: ext3 IO latency measurements (was: Linux 2.6.29)
  2009-03-26 20:42                                                 ` Alan Cox
@ 2009-03-26 20:55                                                   ` Matthew Garrett
  2009-03-26 20:58                                                     ` Alan Cox
  2009-03-26 23:04                                                   ` Bron Gondwana
  1 sibling, 1 reply; 664+ messages in thread
From: Matthew Garrett @ 2009-03-26 20:55 UTC (permalink / raw)
  To: Alan Cox
  Cc: Linus Torvalds, Theodore Tso, Ingo Molnar, Jan Kara,
	Andrew Morton, Arjan van de Ven, Peter Zijlstra, Nick Piggin,
	Jens Axboe, David Rees, Jesper Krogh, Linux Kernel Mailing List,
	Oleg Nesterov, Roland McGrath

On Thu, Mar 26, 2009 at 08:42:09PM +0000, Alan Cox wrote:
> > before performing the update. So while relatime doesn't conform, the 
> > practical difference is meaningless. You can't depend on atime being 
> > updated in a timely manner.
> 
> POSIX says a disk write interrupted by a signal can be a short write. If
> you do this in practice all hell breaks loose.
> 
> A conforming implementation needs to conform with expectations not just
> play lawyer games with users systems.

I agree, but arguing for something on the basis of a spec isn't terribly 
convincing if the spec allows effectively identical behaviour. SuS isn't 
a relevant consideration when it comes to deciding default atime policy.

-- 
Matthew Garrett | mjg59@srcf.ucam.org

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: ext3 IO latency measurements (was: Linux 2.6.29)
  2009-03-26 20:55                                                   ` Matthew Garrett
@ 2009-03-26 20:58                                                     ` Alan Cox
  0 siblings, 0 replies; 664+ messages in thread
From: Alan Cox @ 2009-03-26 20:58 UTC (permalink / raw)
  To: Matthew Garrett
  Cc: Linus Torvalds, Theodore Tso, Ingo Molnar, Jan Kara,
	Andrew Morton, Arjan van de Ven, Peter Zijlstra, Nick Piggin,
	Jens Axboe, David Rees, Jesper Krogh, Linux Kernel Mailing List,
	Oleg Nesterov, Roland McGrath

> I agree, but arguing for something on the basis of a spec isn't terribly 
> convincing if the spec allows effectively identical behaviour. SuS isn't 
> a relevant consideration when it comes to deciding default atime policy.

I'd says its or minor relevance. The default expected behaviour we had
last week is however of major relevance and that is my big concern.

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: ext3 IO latency measurements (was: Linux 2.6.29)
  2009-03-26 18:23                                             ` Bill Nottingham
@ 2009-03-26 22:24                                               ` Linus Torvalds
  2009-03-27 13:47                                                 ` Bill Nottingham
  0 siblings, 1 reply; 664+ messages in thread
From: Linus Torvalds @ 2009-03-26 22:24 UTC (permalink / raw)
  To: Bill Nottingham
  Cc: Theodore Tso, Ingo Molnar, Jan Kara, Andrew Morton, Alan Cox,
	Arjan van de Ven, Peter Zijlstra, Nick Piggin, Jens Axboe,
	David Rees, Jesper Krogh, Linux Kernel Mailing List,
	Oleg Nesterov, Roland McGrath



On Thu, 26 Mar 2009, Bill Nottingham wrote:
> 
> Well, it's got to find the root fs options somewhere. Pulling them
> from the modified /etc/fstab in the root fs before you mount it, well...

Umm.

The _only_ sane thng to do is to mount the root read-only from initramfs, 
and then re-mount it with the options in the /etc/fstab later when you 
re-mount it read-write _anyway_ (which may possibly be immediately, of 
course).

Anybody who thinks you should re-write initramfs for something like this 
really hasn't spent a single second thinking about it.

		Linus

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: [PATCH 2/2] Make relatime default
  2009-03-26 18:48                                                 ` Alan Cox
@ 2009-03-26 22:27                                                   ` Linus Torvalds
  2009-03-27  0:15                                                     ` Frans Pop
                                                                       ` (2 more replies)
  2009-03-30 14:42                                                   ` Andrea Arcangeli
  1 sibling, 3 replies; 664+ messages in thread
From: Linus Torvalds @ 2009-03-26 22:27 UTC (permalink / raw)
  To: Alan Cox
  Cc: Matthew Garrett, Theodore Tso, Ingo Molnar, Jan Kara,
	Andrew Morton, Arjan van de Ven, Peter Zijlstra, Nick Piggin,
	Jens Axboe, David Rees, Jesper Krogh, Linux Kernel Mailing List,
	Oleg Nesterov, Roland McGrath



On Thu, 26 Mar 2009, Alan Cox wrote:
> 
> NAK this again

And I don't care.

If the distro's had done this right in the year+ that this has been in, I 
m ight consider your NAK to have some weight.

As it is, we know that didn't happen, and we've had three different people 
from different distributions say that they wanted to use relatime anyway, 
so it's now the default in my git tree. 

If you want to live in some dark ages, you can do so with the 
"strictatime" thing.

			Linus

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: ext3 IO latency measurements (was: Linux 2.6.29)
  2009-03-26 18:11                                 ` Jan Kara
  2009-03-26 18:35                                   ` Andrew Morton
@ 2009-03-26 22:39                                   ` Linus Torvalds
  2009-03-26 22:57                                     ` Andrew Morton
  1 sibling, 1 reply; 664+ messages in thread
From: Linus Torvalds @ 2009-03-26 22:39 UTC (permalink / raw)
  To: Jan Kara
  Cc: Ingo Molnar, Theodore Tso, Andrew Morton, Alan Cox,
	Arjan van de Ven, Peter Zijlstra, Nick Piggin, Jens Axboe,
	David Rees, Jesper Krogh, Linux Kernel Mailing List,
	Oleg Nesterov, Roland McGrath



On Thu, 26 Mar 2009, Jan Kara wrote:
>
>   Reads are measurably better with the patch - the test with cat you
> describe below took ~0.5s per file without the patch and always less than
> 0.02s with the patch. So it seems to help something.

That would seem to be a _huge_ improvement.

Reads are the biggest issue for starting a new process (eg starting 
firefox while under load), and if cat'ing that small file improved by that 
much, then I bet there's a huge practical implication for a lot of desktop 
uses.

The fundamental fsync() latency problem we sadly can't help much with, the 
way ext3 seems to work. But I do suspect that the whole "don't synchronize 
with the journal for normal write-outs" may end up helping even fsync just 
a bit, if only because I suspect it will improve writeout throughput too 
and thus avoid one particular bottleneck.

			Linus

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: ext3 IO latency measurements (was: Linux 2.6.29)
  2009-03-26 22:39                                   ` Linus Torvalds
@ 2009-03-26 22:57                                     ` Andrew Morton
  2009-03-27 21:38                                       ` Jan Kara
  0 siblings, 1 reply; 664+ messages in thread
From: Andrew Morton @ 2009-03-26 22:57 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jan Kara, Ingo Molnar, Theodore Tso, Alan Cox, Arjan van de Ven,
	Peter Zijlstra, Nick Piggin, Jens Axboe, David Rees,
	Jesper Krogh, Linux Kernel Mailing List, Oleg Nesterov,
	Roland McGrath

On Thu, 26 Mar 2009 15:39:53 -0700 (PDT) Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Thu, 26 Mar 2009, Jan Kara wrote:
> >
> >   Reads are measurably better with the patch - the test with cat you
> > describe below took ~0.5s per file without the patch and always less than
> > 0.02s with the patch. So it seems to help something.
> 
> That would seem to be a _huge_ improvement.

It's strange that we still don't have an ext3_writepages().  Open a
transaction, do a large pile of writes, close the transaction again. 
We don't even have a data=writeback writepages() implementation, which
should be fairly simple.

Bizarre.

Mingming had a shot at it a few years ago and I think Badari did as
well, but I guess it didn't work out.

Falling back to generic_writepages() on our main local fs is a bit lame.

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: ext3 IO latency measurements (was: Linux 2.6.29)
  2009-03-26 15:28                                     ` Theodore Tso
@ 2009-03-26 23:02                                       ` Ingo Molnar
  2009-03-26 23:59                                         ` Theodore Tso
  0 siblings, 1 reply; 664+ messages in thread
From: Ingo Molnar @ 2009-03-26 23:02 UTC (permalink / raw)
  To: Theodore Tso, Jan Kara, Linus Torvalds, Andrew Morton, Alan Cox,
	Arjan van de Ven, Peter Zijlstra, Nick Piggin, Jens Axboe,
	David Rees, Jesper Krogh, Linux Kernel Mailing List,
	Oleg Nesterov, Roland McGrath


* Theodore Tso <tytso@mit.edu> wrote:

> On Thu, Mar 26, 2009 at 03:03:12PM +0100, Ingo Molnar wrote:
> > 
> > I still see similarly bad latencies in Vim:
> > 
> 
> When you say "similarly bad", how many seconds were you seeing?  I 
> understand that from the user's perspective, the 120 seconds you 
> saw with ext3 isn't going to be that different from 15 seconds 
> (which seems to be the maximum commit time in the jbd2 history 
> file you sent me), but I'm curious if what you saw was just as bad 
> with ext4, or was it somewhat better (i.e., 120 seconds vs 15 or 
> so).  Or were you also seeing a net time to save the file using 
> vim of around 120 seconds with ext4?

It was in the minute range, iirc. It was totally unusable 
interactively. I wrote the vim-test script during that workload and 
i'm still getting annoyed thinking back at the experience. Is that 
enough to consider it bad? :-)

This isnt me streaming gigs of data in and out of the system 
dirtying 90% of all RAM. This is a trivial workload barely 
scratching the RAM and CPU capabilities of the system.

Do you have a non-tweaked default Fedora install somewhere? These 
kinds of delays in Vim were easily reproducible in the last 5 years 
and i saw it reported frequently on various lists.

Have you tried to reproduce it? Have you tried CONFIG_LATENCYTOP? We 
implemented that kernel feature specifically to make it easy for 
developers to instrument their kernel and keep system latencies 
down. This isnt some oddball workload or oddball system. These 
latencies are reproducible on just about any Linux development 
system i ever tried with ext3.

And the thing is, to 99.9% of the people it doesnt matter how 
scalable we are to 16000 CPUs or whether a directory with 1 million 
files in it takes 10 or 200 msecs to parse.

But it gives a permanent impression how much delay basic everyday 
operations on the system have. So latency optimizations (and i use 
the term losely here) have to be the primary development metric in 
Linux IMHO. ( If i were doing filesystem development i'd sure 
already have my low-latency filesystem patchset ;-)

	Ingo

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: ext3 IO latency measurements (was: Linux 2.6.29)
  2009-03-26 20:42                                                 ` Alan Cox
  2009-03-26 20:55                                                   ` Matthew Garrett
@ 2009-03-26 23:04                                                   ` Bron Gondwana
  2009-03-27 11:22                                                     ` Alan Cox
  1 sibling, 1 reply; 664+ messages in thread
From: Bron Gondwana @ 2009-03-26 23:04 UTC (permalink / raw)
  To: Alan Cox
  Cc: Matthew Garrett, Linus Torvalds, Theodore Tso, Ingo Molnar,
	Jan Kara, Andrew Morton, Arjan van de Ven, Peter Zijlstra,
	Nick Piggin, Jens Axboe, David Rees, Jesper Krogh,
	Linux Kernel Mailing List, Oleg Nesterov, Roland McGrath

On Thu, Mar 26, 2009 at 08:42:09PM +0000, Alan Cox wrote:
> > before performing the update. So while relatime doesn't conform, the 
> > practical difference is meaningless. You can't depend on atime being 
> > updated in a timely manner.
> 
> POSIX says a disk write interrupted by a signal can be a short write. If
> you do this in practice all hell breaks loose.
> 
> A conforming implementation needs to conform with expectations not just
> play lawyer games with users systems.

Is this the same Alan Cox who thought a couple of months ago that
having an insanely low default maximum number epoll instances was a
reasonable answer to a theoretical DoS risk, despite it breaking
pretty much every reasonable user of the epoll interface?

Bron ( what stable interface? )

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-25 18:40                   ` Stephen Clark
@ 2009-03-26 23:53                     ` Mark Lord
  0 siblings, 0 replies; 664+ messages in thread
From: Mark Lord @ 2009-03-26 23:53 UTC (permalink / raw)
  To: sclark46
  Cc: Arjan van de Ven, Jesse Barnes, Theodore Tso, Ingo Molnar,
	Alan Cox, Andrew Morton, Peter Zijlstra, Nick Piggin, Jens Axboe,
	David Rees, Jesper Krogh, Linus Torvalds,
	Linux Kernel Mailing List

Stephen Clark wrote:
> Arjan van de Ven wrote:
..
>> the people that care use my kernel patch on ext3 ;-)
>> (or the userland equivalent tweak in /etc/rc.local)
>>
>>
>>
> Ok, I bite what is the userland tweak?
 
## /etc/rc.local:
for i in `pidof kjournald` ; do ionice -c1 -p $i ; done

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: ext3 IO latency measurements (was: Linux 2.6.29)
  2009-03-26 23:02                                       ` Ingo Molnar
@ 2009-03-26 23:59                                         ` Theodore Tso
  2009-03-27  0:08                                           ` Ingo Molnar
  2009-03-27  0:40                                           ` Jesse Barnes
  0 siblings, 2 replies; 664+ messages in thread
From: Theodore Tso @ 2009-03-26 23:59 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Jan Kara, Linus Torvalds, Andrew Morton, Alan Cox,
	Arjan van de Ven, Peter Zijlstra, Nick Piggin, Jens Axboe,
	David Rees, Jesper Krogh, Linux Kernel Mailing List,
	Oleg Nesterov, Roland McGrath

On Fri, Mar 27, 2009 at 12:02:40AM +0100, Ingo Molnar wrote:
> This isnt me streaming gigs of data in and out of the system 
> dirtying 90% of all RAM. This is a trivial workload barely 
> scratching the RAM and CPU capabilities of the system.

Have you tried with maxcpus set to say, 2?  My guess is you won't see
the problems in that case.  So I'm not sure saying "barely scratching
the CPU capabilities of the system" is completely fair.  I can
probably get be able to get temporary access to a 16 CPU system, but
that's not the kind of system that I normally get to use for my kernel
devleopment.

> Have you tried to reproduce it? Have you tried CONFIG_LATENCYTOP? We 
> implemented that kernel feature specifically to make it easy for 
> developers to instrument their kernel and keep system latencies 
> down. This isnt some oddball workload or oddball system. These 
> latencies are reproducible on just about any Linux development 
> system i ever tried with ext3.

My normal development is not all that different from yours (make
-j<numcpus*2>) and I do edit and save files while the compile is
going.  I use emacs, but it calls fsync() when saving files, just like
vim does.  The big difference is that for me, numcpus is normally 2.
And my machine has 4 gigs of memory, not 12 gigs.  So I don't see
these problems.  I agree that what you have isn't an "oddball
workload"; as far as whether it is an "oddball system", it is
certainly a system I would lust after.  And I acknowledge the world is
a bit different from when Linus declared that 99% of the world was 1
or 2 CPU's.  I suspect the percentage of machines with 16 CPU's is
still somewhat small, though.

So I'll try to reproduce it on a 16 CPU system, when I have a chance
--- but it's something that I'm going to have to borrow and try to get
remote access to play with such a system.  Clearly your employer is
way more generous with equipment than mine is, at least for personal
development machines.  :-)

In the meantime, if you could run some of the tests and vary some of
the variables I requested, I'd appreciate it, and thank you for your
help.  Otherwise, I'll try to run them when I get remote access to
such a machine where I'm allowed to replace kernels and mount random
test filesystems.

						- Ted

P.S.  Another interesting test would be to plot the vim save latencies
versus the number of CPU's enabled when running the kernel build
workload.

P.P.S.  I assume there's no way you could give me remote ssh access to
your nice 16-way machine?  :-)

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: ext3 IO latency measurements (was: Linux 2.6.29)
  2009-03-26 23:59                                         ` Theodore Tso
@ 2009-03-27  0:08                                           ` Ingo Molnar
  2009-03-27  0:40                                           ` Jesse Barnes
  1 sibling, 0 replies; 664+ messages in thread
From: Ingo Molnar @ 2009-03-27  0:08 UTC (permalink / raw)
  To: Theodore Tso, Jan Kara, Linus Torvalds, Andrew Morton, Alan Cox,
	Arjan van de Ven, Peter Zijlstra, Nick Piggin, Jens Axboe,
	David Rees, Jesper Krogh, Linux Kernel Mailing List,
	Oleg Nesterov, Roland McGrath


* Theodore Tso <tytso@mit.edu> wrote:

> On Fri, Mar 27, 2009 at 12:02:40AM +0100, Ingo Molnar wrote:
> > This isnt me streaming gigs of data in and out of the system 
> > dirtying 90% of all RAM. This is a trivial workload barely 
> > scratching the RAM and CPU capabilities of the system.
> 
> Have you tried with maxcpus set to say, 2?  My guess is you won't see
> the problems in that case.

Note, my previous devel box was a single socket quad and it had such 
delays all the time as well. Havent tried it on a dual-core. (i dont 
have dual-core systems with enough RAM to be able to build a kernel 
purely in RAM)

	Ingo

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-25 23:23               ` Linus Torvalds
  2009-03-25 23:46                 ` Bron Gondwana
@ 2009-03-27  0:11                 ` Andrew Morton
  2009-03-27  0:27                   ` Linus Torvalds
  2009-03-27  9:58                   ` Alan Cox
  1 sibling, 2 replies; 664+ messages in thread
From: Andrew Morton @ 2009-03-27  0:11 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Theodore Tso, David Rees, Jesper Krogh, Linux Kernel Mailing List

On Wed, 25 Mar 2009 16:23:08 -0700 (PDT) Linus Torvalds <torvalds@linux-foundation.org> wrote:

> 
> 
> On Wed, 25 Mar 2009, Theodore Tso wrote:
> > > 
> > > The problem being that unlike the ratio, there's no sane default value 
> > > that you can at least argue is not _entirely_ pointless.
> > 
> > Well, if the maximum time that someone wants to wait for an fsync() to
> > return is one second, and the RAID array can write 100MB/sec
> 
> How are you going to tell the kernel that the RAID array can write 
> 100MB/s?
> 
> The kernel has no idea.
> 

userspace can do it quite easily.  Run a self-tuning script after
installation and when the disk hardware changes significantly.

It is very disappointing that nobody appears to have attempted to do
_any_ sensible tuning of these controls in all this time - we just keep
thrashing around trying to pick better magic numbers in the base kernel. 

Maybe we should set the tunables to 99.9% to make it suck enough to
motivate someone.

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: [PATCH 2/2] Make relatime default
  2009-03-26 22:27                                                   ` Linus Torvalds
@ 2009-03-27  0:15                                                     ` Frans Pop
  2009-03-27  0:30                                                       ` Linus Torvalds
  2009-03-27  2:05                                                     ` Frans Pop
  2009-04-09 20:13                                                     ` Pavel Machek
  2 siblings, 1 reply; 664+ messages in thread
From: Frans Pop @ 2009-03-27  0:15 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: alan, mjg, tytso, mingo, jack, akpm, arjan, a.p.zijlstra,
	npiggin, jens.axboe, drees76, jesper, linux-kernel, oleg, roland

Linus Torvalds wrote:
> If the distro's had done this right in the year+ that this has been in,
> I might consider your NAK to have some weight.
> 
> As it is, we know that didn't happen, and we've had three different
> people from different distributions say that they wanted to use relatime
> anyway, so it's now the default in my git tree.
> 
> If you want to live in some dark ages, you can do so with the
> "strictatime" thing.

I obviously welcome the inclusion of the first patch and I'm neutral about 
the second one, but I'm not at all sure that making relatime the default 
(yet) is the right thing, especially as util-linux' mount command does 
not yet even support that "strictatime" thing. Shouldn't that at least 
happen first (and have been supported in distro's stable releases for 
some time)?

I guess users and distros can still elect not to set it as default, but it 
still seems a bit like going from one extreme to another.

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-27  0:11                 ` Andrew Morton
@ 2009-03-27  0:27                   ` Linus Torvalds
  2009-03-27  0:47                     ` Andrew Morton
  2009-03-27  0:51                     ` Linus Torvalds
  2009-03-27  9:58                   ` Alan Cox
  1 sibling, 2 replies; 664+ messages in thread
From: Linus Torvalds @ 2009-03-27  0:27 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Theodore Tso, David Rees, Jesper Krogh, Linux Kernel Mailing List



On Thu, 26 Mar 2009, Andrew Morton wrote:
> 
> userspace can do it quite easily.  Run a self-tuning script after
> installation and when the disk hardware changes significantly.

Uhhuh.

"user space can do it".

That's the global cop-out.

The fact is, user-space isn't doing it, and never has done anything even 
_remotely_ like it. 

In fact, I claim that it's impossible to do. If you give me a number for 
the throughput of your harddisk, I will laugh in your face and call you a 
moron.

Why? Because no such number exists. It depends on the access patterns. If 
you write one large file, the number will be very different (and not just 
by a few percent) from the numbers of you writing thousands of small 
files, or re-writing a large database in random order.

So no. User space CAN NOT DO IT, and the fact that you even claim 
something like that shows a distinct lack of thought.

> Maybe we should set the tunables to 99.9% to make it suck enough to
> motivate someone.

The only times tunables have worked for us is when they auto-tune. 

IOW, we don't have "use 35% of memory for buffer cache" tunables, we just 
dynamically auto-tune memory use. And no, we don't expect user space to 
run some "tuning program for their load" either.

		Linus

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: [PATCH 2/2] Make relatime default
  2009-03-27  0:15                                                     ` Frans Pop
@ 2009-03-27  0:30                                                       ` Linus Torvalds
  2009-03-27 14:06                                                         ` Alan Cox
  0 siblings, 1 reply; 664+ messages in thread
From: Linus Torvalds @ 2009-03-27  0:30 UTC (permalink / raw)
  To: Frans Pop
  Cc: alan, mjg, tytso, mingo, jack, akpm, arjan, a.p.zijlstra,
	npiggin, jens.axboe, drees76, jesper, linux-kernel, oleg, roland



On Fri, 27 Mar 2009, Frans Pop wrote:
> 
> I guess users and distros can still elect not to set it as default, but it 
> still seems a bit like going from one extreme to another.

Why? RELATIME has been around since 2006 now. Nothing has happened. People 
who think "we should leave it up to user land" lost their credibility long 
ago.

			Linus

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: ext3 IO latency measurements (was: Linux 2.6.29)
  2009-03-26 23:59                                         ` Theodore Tso
  2009-03-27  0:08                                           ` Ingo Molnar
@ 2009-03-27  0:40                                           ` Jesse Barnes
  1 sibling, 0 replies; 664+ messages in thread
From: Jesse Barnes @ 2009-03-27  0:40 UTC (permalink / raw)
  To: Theodore Tso
  Cc: Ingo Molnar, Jan Kara, Linus Torvalds, Andrew Morton, Alan Cox,
	Arjan van de Ven, Peter Zijlstra, Nick Piggin, Jens Axboe,
	David Rees, Jesper Krogh, Linux Kernel Mailing List,
	Oleg Nesterov, Roland McGrath

On Thu, 26 Mar 2009 19:59:36 -0400
Theodore Tso <tytso@mit.edu> wrote:

> On Fri, Mar 27, 2009 at 12:02:40AM +0100, Ingo Molnar wrote:
> > This isnt me streaming gigs of data in and out of the system 
> > dirtying 90% of all RAM. This is a trivial workload barely 
> > scratching the RAM and CPU capabilities of the system.
> 
> Have you tried with maxcpus set to say, 2?  My guess is you won't see
> the problems in that case.  So I'm not sure saying "barely scratching
> the CPU capabilities of the system" is completely fair.  I can
> probably get be able to get temporary access to a 16 CPU system, but
> that's not the kind of system that I normally get to use for my kernel
> devleopment.

Nope, I saw this with my dual CPU machine too (before I upgraded to
quad core)... Just doing kernel builds and/or icecream and/or VMware.
It didn't take much.  I have 8G of memory now but I used to have less
(3G iirc) and saw it then too.

> My normal development is not all that different from yours (make
> -j<numcpus*2>) and I do edit and save files while the compile is
> going.  I use emacs, but it calls fsync() when saving files, just like
> vim does.  The big difference is that for me, numcpus is normally 2.
> And my machine has 4 gigs of memory, not 12 gigs.  So I don't see
> these problems.  I agree that what you have isn't an "oddball
> workload"; as far as whether it is an "oddball system", it is
> certainly a system I would lust after.  And I acknowledge the world is
> a bit different from when Linus declared that 99% of the world was 1
> or 2 CPU's.  I suspect the percentage of machines with 16 CPU's is
> still somewhat small, though.

I'm surprised you haven't seen this then... Maybe your journal is
bigger?  Or some other config difference...

-- 
Jesse Barnes, Intel Open Source Technology Center

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-27  0:27                   ` Linus Torvalds
@ 2009-03-27  0:47                     ` Andrew Morton
  2009-03-27  1:03                       ` Linus Torvalds
  2009-03-27  0:51                     ` Linus Torvalds
  1 sibling, 1 reply; 664+ messages in thread
From: Andrew Morton @ 2009-03-27  0:47 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Theodore Tso, David Rees, Jesper Krogh, Linux Kernel Mailing List

On Thu, 26 Mar 2009 17:27:43 -0700 (PDT) Linus Torvalds <torvalds@linux-foundation.org> wrote:

> 
> 
> On Thu, 26 Mar 2009, Andrew Morton wrote:
> > 
> > userspace can do it quite easily.  Run a self-tuning script after
> > installation and when the disk hardware changes significantly.
> 
> Uhhuh.
> 
> "user space can do it".
> 
> That's the global cop-out.

userspace can get closer than the kernel can.

> The fact is, user-space isn't doing it, and never has done anything even 
> _remotely_ like it. 
> 
> In fact, I claim that it's impossible to do. If you give me a number for 
> the throughput of your harddisk, I will laugh in your face and call you a 
> moron.
>
> Why? Because no such number exists. It depends on the access patterns.

Those access patterns are observable!

> If 
> you write one large file, the number will be very different (and not just 
> by a few percent) from the numbers of you writing thousands of small 
> files, or re-writing a large database in random order.
> 
> So no. User space CAN NOT DO IT, and the fact that you even claim 
> something like that shows a distinct lack of thought.

userspace can get closer.  Even if it's asking the user "what sort of
applications will this machine be running" and then use a set of canned
tunables based on that.

Better would be to observe system behaviour, perhaps in real time and
make adjustments.

> > Maybe we should set the tunables to 99.9% to make it suck enough to
> > motivate someone.
> 
> The only times tunables have worked for us is when they auto-tune. 
> 
> IOW, we don't have "use 35% of memory for buffer cache" tunables, we just 
> dynamically auto-tune memory use. And no, we don't expect user space to 
> run some "tuning program for their load" either.
> 

This particular case is exceptional - it's just too hard for the kernel
to be able to predict the future for this one.

It wouldn't be terribly hard for a userspace daemon to produce better
results than we can achieve in-kernel.  That might of course require
additional kernel work to support it well.

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-27  0:27                   ` Linus Torvalds
  2009-03-27  0:47                     ` Andrew Morton
@ 2009-03-27  0:51                     ` Linus Torvalds
  2009-03-27  1:03                       ` Andrew Morton
  1 sibling, 1 reply; 664+ messages in thread
From: Linus Torvalds @ 2009-03-27  0:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Theodore Tso, David Rees, Jesper Krogh, Linux Kernel Mailing List



On Thu, 26 Mar 2009, Linus Torvalds wrote:
> 
> The only times tunables have worked for us is when they auto-tune. 
> 
> IOW, we don't have "use 35% of memory for buffer cache" tunables, we just 
> dynamically auto-tune memory use. And no, we don't expect user space to 
> run some "tuning program for their load" either.

IOW, what we could reasonably do is something along the lines of:

 - start off with some reasonable value for max background dirty (per 
   block device) that defaults to something sane (quite possibly based on 
   simply memory size).

 - assume that "foreground dirty" is just always 2* background dirty.

 - if we hit the "max foreground dirty" during memory allocation, then we 
   shrink the background dirty value (logic: we never want to have to wait 
   synchronously)

 - if we hit some maximum latency on writeback, shrink dirty aggressively 
   and based on how long the latency was (because at that point we have a 
   real _measure_ of how costly it is with that load).

 - if we start doing background dirtying, but never hit the foreground 
   dirty even in dirty balancing (ie when a writer is actually _writing_, 
   as opposed to hitting it when allocating memory by a non-writer), then 
   slowly open up the window - we may be limiting too early.

.. add heuristics to taste. The point being, that if we do this based on 
real loads, and based on hitting the real problems, then we might actually 
be getting somewhere. In particular, if the filesystem sucks at writeout 
(ie the limiter is not the _disk_, but the filesystem serialization), then 
it should automatically also shrink the max dirty state. 

The tunable then could become the maximum latency we accept or something 
like that. Or the hysteresis limits/rules for the soft "grow" or "shrink" 
events. At that point, maybe we could even find something that works for 
most people.

			Linus

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-27  0:51                     ` Linus Torvalds
@ 2009-03-27  1:03                       ` Andrew Morton
  0 siblings, 0 replies; 664+ messages in thread
From: Andrew Morton @ 2009-03-27  1:03 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Theodore Tso, David Rees, Jesper Krogh, Linux Kernel Mailing List

On Thu, 26 Mar 2009 17:51:44 -0700 (PDT) Linus Torvalds <torvalds@linux-foundation.org> wrote:

> 
> 
> On Thu, 26 Mar 2009, Linus Torvalds wrote:
> > 
> > The only times tunables have worked for us is when they auto-tune. 
> > 
> > IOW, we don't have "use 35% of memory for buffer cache" tunables, we just 
> > dynamically auto-tune memory use. And no, we don't expect user space to 
> > run some "tuning program for their load" either.
> 
> IOW, what we could reasonably do is something along the lines of:
> 
>  - start off with some reasonable value for max background dirty (per 
>    block device) that defaults to something sane (quite possibly based on 
>    simply memory size).
> 
>  - assume that "foreground dirty" is just always 2* background dirty.
> 
>  - if we hit the "max foreground dirty" during memory allocation, then we 
>    shrink the background dirty value (logic: we never want to have to wait 
>    synchronously)
> 
>  - if we hit some maximum latency on writeback, shrink dirty aggressively 
>    and based on how long the latency was (because at that point we have a 
>    real _measure_ of how costly it is with that load).
> 
>  - if we start doing background dirtying, but never hit the foreground 
>    dirty even in dirty balancing (ie when a writer is actually _writing_, 
>    as opposed to hitting it when allocating memory by a non-writer), then 
>    slowly open up the window - we may be limiting too early.
> 
> .. add heuristics to taste. The point being, that if we do this based on 
> real loads, and based on hitting the real problems, then we might actually 
> be getting somewhere. In particular, if the filesystem sucks at writeout 
> (ie the limiter is not the _disk_, but the filesystem serialization), then 
> it should automatically also shrink the max dirty state. 
> 
> The tunable then could become the maximum latency we accept or something 
> like that. Or the hysteresis limits/rules for the soft "grow" or "shrink" 
> events. At that point, maybe we could even find something that works for 
> most people.
> 

hm.

It may not be too hard to account for seekiness.  Simplest case: if we
dirty a page and that page is file-contiguous to another already dirty
page then don't increment the dirty page count by "1": increment it by
0.01.

Another simple case would be to keep track of the _number_ of dirty
inodes rather than simply lumping all dirty pages together.

And then there's metadata.  The dirty balancing code doesn't account
for dirty inodes _at all_ at present.

(Many years ago there was a bug wherein we could have zillions of dirty
inodes and exactly zero dirty pages, and the writeback code wouldn't
trigger at all - the inodes would just sit there until a page got
dirtied - this might still be there).


Then again, perhaps we don't need all those discrete heuristic things. 
Maybe it can all be done in mark_buffer_dirty().  Do some clever
math+data-structure to track the seekiness of our dirtiness.  Delayed
allocation would mess that up though.


^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-27  0:47                     ` Andrew Morton
@ 2009-03-27  1:03                       ` Linus Torvalds
  2009-03-27  1:25                         ` Andrew Morton
  2009-03-27  3:23                         ` Theodore Tso
  0 siblings, 2 replies; 664+ messages in thread
From: Linus Torvalds @ 2009-03-27  1:03 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Theodore Tso, David Rees, Jesper Krogh, Linux Kernel Mailing List



On Thu, 26 Mar 2009, Andrew Morton wrote:
> 
> userspace can get closer than the kernel can.

Andrew, that's SIMPLY NOT TRUE.

You state that without any amount of data to back it up, as if it was some 
kind of truism. It's not.

> > Why? Because no such number exists. It depends on the access patterns.
> 
> Those access patterns are observable!

Not by user space they aren't, and not dynamically. At least not as well 
as they are for the kernel.

So when you say "user space can do it better", you base that statement on 
exactly what? The night-time whisperings of the small creatures living in 
your basement?

The fact is, user space can't do better. And perhaps equally importantly, 
we have 16 years of history with user space tuning, and that history tells 
us unequivocally that user space never does anything like this.

Name _one_ case where even simple tuning has happened, and where it has 
actually _worked_?

I claim you cannot. And I have counter-examples. Just look at the utter 
fiasco that was user-space "tuning" of nice-levels that distros did. Ooh. 
Yeah, it didn't work so well, did it? Especially not when the kernel 
changed subtly, and the "tuning" that had been done was shown to be 
utter crap.

> > dynamically auto-tune memory use. And no, we don't expect user space to 
> > run some "tuning program for their load" either.
> > 
> 
> This particular case is exceptional - it's just too hard for the kernel
> to be able to predict the future for this one.

We've never even tried. 

The dirty limit was never about trying to tune things, it started out as 
protection against deadlocks and other catastrophic failures. We used to 
allow 50% dirty or something like that (which is not unlike our old buffer 
cache limits, btw), and then when we had a HIGHMEM lockup issue it got 
severly cut down. At no point was that number even _trying_ to limit 
latency, other than as a "hey, it's probably good to not have all memory 
tied up in dirty pages" kind of secondary way.

I claim that the whole balancing between inodes/dentries/pagecache/swap/ 
anonymous memory/what-not is likely a much harder problem. And no, I'm not 
claiming that we "solved" that problem, but we've clearly done a pretty 
good job over the years of getting to a reasonable end result.

Sure, you can still tune "swappiness" (nobody much does), but even there 
you don't actually tune how much memory you use for swap cache, you do 
more of a "meta-tuning" where you tune how the auto-tuning works.

That is something we have shown to work historically.

That said, the real problem isn't even the tuning. The real problem is a 
filesystem issue. If "fsync()" cost was roughly proportional to the size 
of the changes to the file we are fsync'ing, nobody would even complain. 

Everybody accepts that if you've written a 20MB file and then call 
"fsync()" on it, it's going to take a while. But when you've written a 2kB 
file, and "fsync()" takes 20 seconds, because somebody else is just 
writing normally, _that_ is a bug. And it is actually almost totally 
unrelated to the whole 'dirty_limit' thing.

At least it _should_ be. 

			Linus

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-27  1:03                       ` Linus Torvalds
@ 2009-03-27  1:25                         ` Andrew Morton
  2009-03-27  2:21                           ` David Rees
                                             ` (4 more replies)
  2009-03-27  3:23                         ` Theodore Tso
  1 sibling, 5 replies; 664+ messages in thread
From: Andrew Morton @ 2009-03-27  1:25 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Theodore Tso, David Rees, Jesper Krogh, Linux Kernel Mailing List

On Thu, 26 Mar 2009 18:03:15 -0700 (PDT) Linus Torvalds <torvalds@linux-foundation.org> wrote:

> 
> 
> On Thu, 26 Mar 2009, Andrew Morton wrote:
> > 
> > userspace can get closer than the kernel can.
> 
> Andrew, that's SIMPLY NOT TRUE.
> 
> You state that without any amount of data to back it up, as if it was some 
> kind of truism. It's not.

I've seen you repeatedly fiddle the in-kernel defaults based on
in-field experience.  That could just as easily have been done in
initscripts by distros, and much more effectively because it doesn't
need a new kernel.  That's data.

The fact that this hasn't even been _attempted_ (afaik) is deplorable.

Why does everyone just sit around waiting for the kernel to put a new
value into two magic numbers which userspace scripts could have set?

My /etc/rc.local has been tweaking dirty_ratio, dirty_background_ratio
and swappiness for many years.  I guess I'm just incredibly advanced.

> Everybody accepts that if you've written a 20MB file and then call 
> "fsync()" on it, it's going to take a while. But when you've written a 2kB 
> file, and "fsync()" takes 20 seconds, because somebody else is just 
> writing normally, _that_ is a bug. And it is actually almost totally 
> unrelated to the whole 'dirty_limit' thing.
> 
> At least it _should_ be. 

That's different.  It's inherent JBD/ext3-ordered brain damage. 
Unfixable without turning the fs into something which just isn't jbd/ext3
any more.  data=writeback is a workaround, with the obvious integrity
issues.

The JBD journal is a massive designed-in contention point.  It's why
for several years I've been telling anyone who will listen that we need
a new fs.  Hopefully our response to all these problems will soon be
"did you try btrfs?".


^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: [PATCH 2/2] Make relatime default
  2009-03-26 22:27                                                   ` Linus Torvalds
  2009-03-27  0:15                                                     ` Frans Pop
@ 2009-03-27  2:05                                                     ` Frans Pop
  2009-04-09 20:13                                                     ` Pavel Machek
  2 siblings, 0 replies; 664+ messages in thread
From: Frans Pop @ 2009-03-27  2:05 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: alan, mjg, tytso, mingo, jack, akpm, arjan, a.p.zijlstra,
	npiggin, jens.axboe, drees76, jesper, linux-kernel, oleg, roland

On Friday 27 March 2009, you wrote:
> On Fri, 27 Mar 2009, Frans Pop wrote:
> > I guess users and distros can still elect not to set it as default,
> > but it still seems a bit like going from one extreme to another.
>
> Why? RELATIME has been around since 2006 now. Nothing has happened.
> People who think "we should leave it up to user land" lost their
> credibility long ago.

As I think Andrew already noted, the discussion today is largely a rehash 
of one in 2007, summarized by lwn [1] and kerneltrap [2]. That's also 
when Ingo first submitted the patch (based on a suggestion from you). But 
it has been blocked by others twice, and for exactly the same reasons.

relatime *without* the 24-hour safeguard has unanimously been deemed 
unsuitable as a default by distros.
So the real problem is that nobody ever did the work needed to make Ingo's 
original patch acceptable to the fs devs and the resulting stalemate for 
the last 1 3/4 years. IMO that's mainly a kernel community failure and 
not a user land failure. You've now at least broken that stalemate.

Your statement is also not quite true. At least Ubuntu has had relatime 
enabled by default for new installations for a couple of releases. And 
AFAICT they now even have it enabled by default now in their kernel 
config, but I'm not entirely sure.

For Debian Lenny (current stable release), relatime is a mount option that 
can be activated during new installs (admittedly only if you look hard 
enough). All that would have been needed for Debian to enable relatime by 
default for new installs was to have something like the *first* patch of 
the three you've now committed to have been included in 2.6.26.

Cheers,
FJP

[1] http://lwn.net/Articles/244829/
[2] http://kerneltrap.org/node/14148

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-27  1:25                         ` Andrew Morton
@ 2009-03-27  2:21                           ` David Rees
  2009-03-27  3:03                             ` Matthew Garrett
  2009-03-27  3:36                             ` Dave Jones
  2009-03-27  3:01                           ` Matthew Garrett
                                             ` (3 subsequent siblings)
  4 siblings, 2 replies; 664+ messages in thread
From: David Rees @ 2009-03-27  2:21 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, Theodore Tso, Jesper Krogh, Linux Kernel Mailing List

On Thu, Mar 26, 2009 at 6:25 PM, Andrew Morton
<akpm@linux-foundation.org> wrote:
> Why does everyone just sit around waiting for the kernel to put a new
> value into two magic numbers which userspace scripts could have set?
>
> My /etc/rc.local has been tweaking dirty_ratio, dirty_background_ratio
> and swappiness for many years.  I guess I'm just incredibly advanced.

The only people who bother to tune those values are people who get
annoyed enough to do the research to see if it's something that's
tunable - hackers.

Everyone else simply says "man, Linux *sucks*" and lives life hoping
it will get better some day.  From posts in this thread - even most
developers just live with it, and have been doing so for *years*.

Even Linux distros don't bother modifying init scripts - they patch
them into kernel instead.  I routinely watch Fedora kernel changelogs
and found these comments in the changelog recently:

* Mon Mar 23 2009 xx <xx@xx.xx> 2.6.29-2
 - Change default swappiness setting from 60 to 30.

* Thu Mar 19 2009 xx <xx@xx.xx> 2.6.29-0.66.rc8.git4
 - Raise default vm dirty data limits from 5/10 to 10/20 percent.

Why are the going in the kernel package instead of /etc/sysctl.conf?
Why is Fedora deviating from upstream? (probably sqlite performance)
Maybe there's a good reason to put them into the kernel - for some
reason the latest kernels perform better with those values where the
previous ones didn't.  But still - why ship those 2 bytes of
configuration in a 75MB package instead of one that could be a
fraction of that size?

Does *any* distro fiddle those bits in userspace instead of patching the kernel?

-Dave

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: [PATCH v2] issue storage device flush via sync_blockdev()
  2009-03-26  3:24                                   ` [PATCH v2] issue storage device flush via sync_blockdev() Jeff Garzik
@ 2009-03-27  2:50                                     ` Theodore Tso
  2009-03-27  3:17                                       ` Jeff Garzik
  2009-03-27 20:50                                     ` [PATCH] issue storage dev flush from generic file_fsync helper Jeff Garzik
  1 sibling, 1 reply; 664+ messages in thread
From: Theodore Tso @ 2009-03-27  2:50 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Linus Torvalds, Ingo Molnar, Alan Cox, Arjan van de Ven,
	Andrew Morton, Peter Zijlstra, Nick Piggin, David Rees,
	Jesper Krogh, Linux Kernel Mailing List

Jeff,

FYI, I tried your patch; it causes the lvm process called out of
the initramfs from an Ubuntu 8.10 system to blow up while trying to
set up the root filesystem.  The stack trace was:

generic_make_request+0x2a3/0x2e6
trace_hardirqs_on_caller+0x111/0x135
mempool_alloc_slab+0xe/0x10
mempool_alloc+0x42/0xe0
submit_bio+0xad/0xb5
bio_alloc_bioset+0x21/0xfc
blkdev_issue_flush+0x7f/0xfc
syn_blockdev+0x2a/0x36
__blkdev_put_0x44/0x131
blkdev_put+0xa/0xc
blkdev_close+0x2e/0x32
__fput+0xcf/0x15f
fput+0x19/0x1b
filp_close+0x51/0x5b
sys_close+0x73/0xad

						- Ted

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-27  1:25                         ` Andrew Morton
  2009-03-27  2:21                           ` David Rees
@ 2009-03-27  3:01                           ` Matthew Garrett
  2009-03-27  3:38                           ` Linus Torvalds
                                             ` (2 subsequent siblings)
  4 siblings, 0 replies; 664+ messages in thread
From: Matthew Garrett @ 2009-03-27  3:01 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, Theodore Tso, David Rees, Jesper Krogh,
	Linux Kernel Mailing List

On Thu, Mar 26, 2009 at 06:25:19PM -0700, Andrew Morton wrote:
> On Thu, 26 Mar 2009 18:03:15 -0700 (PDT) Linus Torvalds <torvalds@linux-foundation.org> wrote:
> > You state that without any amount of data to back it up, as if it was some 
> > kind of truism. It's not.
> 
> I've seen you repeatedly fiddle the in-kernel defaults based on
> in-field experience.  That could just as easily have been done in
> initscripts by distros, and much more effectively because it doesn't
> need a new kernel.  That's data.

If there's a sensible default then it belongs in the kernel. Forcing 
these decisions out to userspace just means that every distribution 
needs to work out what these settings are, and the evidence we've seen 
when they attempt to do this is that we end up with things like broken 
cpufreq parameters because these are difficult problems. The simple 
reality is that almost every single distribution lacks developers with 
sufficient understanding of the problem to make the correct choice.

The typical distribution lifecycle is significantly longer than a kernel 
release cycle. It's massively easier for people to pull updated kernels.

> Why does everyone just sit around waiting for the kernel to put a new
> value into two magic numbers which userspace scripts could have set?

If the distribution can set a globally correct value then that globally 
correct value should be there in the first place!

> My /etc/rc.local has been tweaking dirty_ratio, dirty_background_ratio
> and swappiness for many years.  I guess I'm just incredibly advanced.

And how have you got these values pushed into other distributions? Is 
your rc.local available anywhere?

Linus is absolutely right here. Pushing these decisions out to userspace 
means duplicated work in the best case - in the worst case it means most 
users end up with the wrong value.

-- 
Matthew Garrett | mjg59@srcf.ucam.org

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-27  2:21                           ` David Rees
@ 2009-03-27  3:03                             ` Matthew Garrett
  2009-03-27  3:36                             ` Dave Jones
  1 sibling, 0 replies; 664+ messages in thread
From: Matthew Garrett @ 2009-03-27  3:03 UTC (permalink / raw)
  To: David Rees
  Cc: Andrew Morton, Linus Torvalds, Theodore Tso, Jesper Krogh,
	Linux Kernel Mailing List

On Thu, Mar 26, 2009 at 07:21:08PM -0700, David Rees wrote:

> Does *any* distro fiddle those bits in userspace instead of patching the kernel?

Given that the optimal values of these tunables often seems to vary 
between kernel versions, it's easier to just put them in the kernel. 

-- 
Matthew Garrett | mjg59@srcf.ucam.org

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: [PATCH v2] issue storage device flush via sync_blockdev()
  2009-03-27  2:50                                     ` Theodore Tso
@ 2009-03-27  3:17                                       ` Jeff Garzik
  2009-03-27  3:30                                         ` Theodore Tso
  0 siblings, 1 reply; 664+ messages in thread
From: Jeff Garzik @ 2009-03-27  3:17 UTC (permalink / raw)
  To: Theodore Tso, Jeff Garzik, Linus Torvalds, Ingo Molnar, Alan Cox,
	Arjan van de Ven, Andrew Morton, Peter Zijlstra, Nick Piggin,
	David Rees, Jesper Krogh, Linux Kernel Mailing List

Theodore Tso wrote:
> Jeff,
> 
> FYI, I tried your patch; it causes the lvm process called out of
> the initramfs from an Ubuntu 8.10 system to blow up while trying to
> set up the root filesystem.  The stack trace was:
> 
> generic_make_request+0x2a3/0x2e6
> trace_hardirqs_on_caller+0x111/0x135
> mempool_alloc_slab+0xe/0x10
> mempool_alloc+0x42/0xe0
> submit_bio+0xad/0xb5
> bio_alloc_bioset+0x21/0xfc
> blkdev_issue_flush+0x7f/0xfc
> syn_blockdev+0x2a/0x36
> __blkdev_put_0x44/0x131
> blkdev_put+0xa/0xc
> blkdev_close+0x2e/0x32
> __fput+0xcf/0x15f
> fput+0x19/0x1b
> filp_close+0x51/0x5b
> sys_close+0x73/0xad

hmmm, I wonder if DM/LVM doesn't like blkdev_issue_flush, or it's too 
early, or what.  I'll toss Ubuntu onto a VM and check it out...

	Jeff




^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-27  1:03                       ` Linus Torvalds
  2009-03-27  1:25                         ` Andrew Morton
@ 2009-03-27  3:23                         ` Theodore Tso
  2009-03-27  3:47                           ` Matthew Garrett
  1 sibling, 1 reply; 664+ messages in thread
From: Theodore Tso @ 2009-03-27  3:23 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrew Morton, David Rees, Jesper Krogh, Linux Kernel Mailing List

On Thu, Mar 26, 2009 at 06:03:15PM -0700, Linus Torvalds wrote:
> 
> Everybody accepts that if you've written a 20MB file and then call 
> "fsync()" on it, it's going to take a while. But when you've written a 2kB 
> file, and "fsync()" takes 20 seconds, because somebody else is just 
> writing normally, _that_ is a bug. And it is actually almost totally 
> unrelated to the whole 'dirty_limit' thing.

Yeah, well, it's caused by data=ordered, which is an ext3 unique
thing; no other filesystem (or operating system) has such a feature.
I'm beginning to wish we hadn't implemented it.  Yeah, it solved a
security problem (which delayed allocation also solves), but it
trained application programs to be careless about fsync(), and it's
caused us so many other problems, including the fsync() and unrelated
commit latency problems.

We are where we are, though, and people have been trained to think
they don't need fsync(), so we're going to have to deal with the
problem by having these implied fsync for cases like
replace-via-rename, and in addition to that, some kind of hueristic to
force out writes early to avoid these huge write latencies.  It would
be good to make it be autotuning it so that filesystems that don't do
ext3 data=ordered don't have to pay the price of having to force out
writes so aggressively early (since in some cases if the file
subsequently is deleted, we might be able to optimize out the write
altogether --- and that's good for SSD endurance).

	     	       		      	  	 - Ted

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: [PATCH v2] issue storage device flush via sync_blockdev()
  2009-03-27  3:17                                       ` Jeff Garzik
@ 2009-03-27  3:30                                         ` Theodore Tso
  0 siblings, 0 replies; 664+ messages in thread
From: Theodore Tso @ 2009-03-27  3:30 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Linus Torvalds, Ingo Molnar, Alan Cox, Arjan van de Ven,
	Andrew Morton, Peter Zijlstra, Nick Piggin, David Rees,
	Jesper Krogh, Linux Kernel Mailing List

On Thu, Mar 26, 2009 at 11:17:44PM -0400, Jeff Garzik wrote:
> Theodore Tso wrote:
>> Jeff,
>>
>> FYI, I tried your patch; it causes the lvm process called out of
>> the initramfs from an Ubuntu 8.10 system to blow up while trying to
>> set up the root filesystem.  The stack trace was:
>>
>> generic_make_request+0x2a3/0x2e6
>> trace_hardirqs_on_caller+0x111/0x135
>> mempool_alloc_slab+0xe/0x10
>> mempool_alloc+0x42/0xe0
>> submit_bio+0xad/0xb5
>> bio_alloc_bioset+0x21/0xfc
>> blkdev_issue_flush+0x7f/0xfc
>> syn_blockdev+0x2a/0x36
>> __blkdev_put_0x44/0x131
>> blkdev_put+0xa/0xc
>> blkdev_close+0x2e/0x32
>> __fput+0xcf/0x15f
>> fput+0x19/0x1b
>> filp_close+0x51/0x5b
>> sys_close+0x73/0xad
>
> hmmm, I wonder if DM/LVM doesn't like blkdev_issue_flush, or it's too  
> early, or what.  I'll toss Ubuntu onto a VM and check it out...

I forgot to mention.  The failure was the EIP was NULL; so it looks
like we called a null function pointer.  The only function pointer
derference I can find is q->make_request_fn() in line 1460 of
blk-core.c, in __generic_make_request.  But that doesn't seem make any
sense....  

Anyway, maybe you can figure out what's going on.  The problem
disappeared as soon as I popped off this patch, though, so it was
pretty clearly the culprit.

					- Ted

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-27  2:21                           ` David Rees
  2009-03-27  3:03                             ` Matthew Garrett
@ 2009-03-27  3:36                             ` Dave Jones
  1 sibling, 0 replies; 664+ messages in thread
From: Dave Jones @ 2009-03-27  3:36 UTC (permalink / raw)
  To: David Rees
  Cc: Andrew Morton, Linus Torvalds, Theodore Tso, Jesper Krogh,
	Linux Kernel Mailing List

On Thu, Mar 26, 2009 at 07:21:08PM -0700, David Rees wrote:
 
 > * Mon Mar 23 2009 xx <xx@xx.xx> 2.6.29-2
 >  - Change default swappiness setting from 60 to 30.
 > 
 > * Thu Mar 19 2009 xx <xx@xx.xx> 2.6.29-0.66.rc8.git4
 >  - Raise default vm dirty data limits from 5/10 to 10/20 percent.
 > 
 > Why are the going in the kernel package instead of /etc/sysctl.conf?

At least in part, because rpm sucks.
If a user has editted /etc/sysctl.conf, upgrading the initscripts package
won't change that file.

	Dave


^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-27  1:25                         ` Andrew Morton
  2009-03-27  2:21                           ` David Rees
  2009-03-27  3:01                           ` Matthew Garrett
@ 2009-03-27  3:38                           ` Linus Torvalds
  2009-03-27  3:59                             ` Linus Torvalds
  2009-03-28  5:06                           ` Ingo Molnar
  2009-04-01 21:03                           ` Lennart Sorensen
  4 siblings, 1 reply; 664+ messages in thread
From: Linus Torvalds @ 2009-03-27  3:38 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Theodore Tso, David Rees, Jesper Krogh, Linux Kernel Mailing List



On Thu, 26 Mar 2009, Andrew Morton wrote:
> 
> Why does everyone just sit around waiting for the kernel to put a new
> value into two magic numbers which userspace scripts could have set?
> 
> My /etc/rc.local has been tweaking dirty_ratio, dirty_background_ratio
> and swappiness for many years.  I guess I'm just incredibly advanced.

.. and as a result you're also testing something that nobody else is.

Look at the complaints from people about fsync behavior that Ted says he 
cannot see. Let me guess: it's because Ted probably has tweaked his 
environment, because he is advanced. As a result, other people see 
problems, he does not.

That's not "advanced". That's totally f*cking broken.

Having different distributions tweak all those tweakables is just even 
_more_ so. It's the anti-thesis of "advanced". It's just stupid.

We should aim to get it right. The "user space can tweak any numbers they 
want" is ALWAYS THE WRONG ANSWER. It's a cop-out, but more importantly, 
it's a cop-out that doesn't even work, and that just results in everybody 
having different setups. Then nobody is happy.

		Linus


^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-27  3:23                         ` Theodore Tso
@ 2009-03-27  3:47                           ` Matthew Garrett
  2009-03-27  5:13                             ` Theodore Tso
  0 siblings, 1 reply; 664+ messages in thread
From: Matthew Garrett @ 2009-03-27  3:47 UTC (permalink / raw)
  To: Theodore Tso, Linus Torvalds, Andrew Morton, David Rees,
	Jesper Krogh, Linux Kernel Mailing List

On Thu, Mar 26, 2009 at 11:23:01PM -0400, Theodore Tso wrote:

> Yeah, well, it's caused by data=ordered, which is an ext3 unique
> thing; no other filesystem (or operating system) has such a feature.
> I'm beginning to wish we hadn't implemented it.  Yeah, it solved a
> security problem (which delayed allocation also solves), but it
> trained application programs to be careless about fsync(), and it's
> caused us so many other problems, including the fsync() and unrelated
> commit latency problems.

Oh, for the love of a whole range of mythological figures. ext3 didn't 
train application programmers that they could be careless about fsync(). 
It gave them functionality that they wanted, ie the ability to do things 
like rename a file over another one with the expectation that these 
operations would actually occur in the same order that they were 
generated. More to the point, it let them do this *without* having to 
call fsync(), resulting in a significant improvement in filesystem 
usability.

I'm utterly and screamingly bored of this "Blame userspace" attitude. 
The simple fact of the matter is that ext4 was designed without paying 
any attention to how the majority of applications behave. fsync() isn't 
the interface people want. ext3 demonstrated that a filesystem could be 
written that made life easier for application authors. Why on earth 
would anyone think that taking a step back by requiring fsync() in a 
wider range of circumstances was a good idea?
-- 
Matthew Garrett | mjg59@srcf.ucam.org

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-27  3:38                           ` Linus Torvalds
@ 2009-03-27  3:59                             ` Linus Torvalds
  2009-03-28 23:52                               ` david
  0 siblings, 1 reply; 664+ messages in thread
From: Linus Torvalds @ 2009-03-27  3:59 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Theodore Tso, David Rees, Jesper Krogh, Linux Kernel Mailing List



On Thu, 26 Mar 2009, Linus Torvalds wrote:
> 
> We should aim to get it right. The "user space can tweak any numbers they 
> want" is ALWAYS THE WRONG ANSWER. It's a cop-out, but more importantly, 
> it's a cop-out that doesn't even work, and that just results in everybody 
> having different setups. Then nobody is happy.

In fact it results in "everybody" just having the distro defaults, which 
in some cases then depend on things like which particular version they 
initially installed things with (because some decisions end up being 
codified in long-term memory by that initial install - like the size of 
the journal when you mkfs'd your filesystem, or the alignment of your 
partitions, or whatever).

The exception, of course, ends up being power-users that then tweak things 
on their own.

Me, I may be a power user, but I absolutely refuse to touch default 
values. If they are wrong, they should be fixed. I don't want to add 
"relatime" to my /etc/fstab, because then the next time I install, I'll 
forget - and if I really need to do that, then the kernel should have 
already done it for me as the default choice.

I also don't want to say that "Fedora should just do it right" (I'll 
complain about things Fedora does badly, but not setting magic values in 
/proc is not one of them), because then even if Fedora _were_ to get 
things right, others won't. Or even worse, somebody will point that SuSE 
or Ubuntu _did_ do it right, but the distro I happen to use is doing the 
wrong thing.

And yes, I could do my own site-specific tweaks, but again, why should I?  
If the tweak really is needed, I should put it in the generic kernel. I 
don't do anything odd. 

End result: regardless of scenario, depending on user-land tweaking is 
always the wrong thing. It's the wrong thing for distributions (they'd all 
need to do the exact same thing anyway, or chaos reigns, so it might as 
well be a kernel default), and it's the wrong thing for individuals 
(because 99.9% of individuals won't know what to do, and the remaining 
0.1% should be trying to improve _other_ peoples experiences, not just 
their own!).

The only excuse _ever_ for user-land tweaking is if you do something 
really odd. Say that you want to get the absolutely best OLTP numbers you 
can possibly get - with no regards for _any_ other workload. In that case, 
you want to tweak the numbers for that exact load, and the exact machine 
that runs it - and the result is going to be a totally worthless number 
(since it's just benchmarketing and doesn't actually reflect any real 
world scenario), but hey, that's what benchmarketing is all about.

Or say that you really are a very embedded environment, with a very 
specific load. A router, a cellphone, a base station, whatever - you do 
one thing, and you're not trying to be a general purpose machine. Then you 
can tweak for that load. But not otherwise.

If you don't have any magical odd special workloads, you shouldn't need to 
tweak a single kernel knob. Because if you need to, then the kernel is 
doing something wrong to begin with.

			Linus

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-27  3:47                           ` Matthew Garrett
@ 2009-03-27  5:13                             ` Theodore Tso
  2009-03-27  5:57                               ` Matthew Garrett
  2009-04-03 12:39                               ` Pavel Machek
  0 siblings, 2 replies; 664+ messages in thread
From: Theodore Tso @ 2009-03-27  5:13 UTC (permalink / raw)
  To: Matthew Garrett
  Cc: Linus Torvalds, Andrew Morton, David Rees, Jesper Krogh,
	Linux Kernel Mailing List

On Fri, Mar 27, 2009 at 03:47:05AM +0000, Matthew Garrett wrote:
> Oh, for the love of a whole range of mythological figures. ext3 didn't 
> train application programmers that they could be careless about fsync(). 
> It gave them functionality that they wanted, ie the ability to do things 
> like rename a file over another one with the expectation that these 
> operations would actually occur in the same order that they were 
> generated. More to the point, it let them do this *without* having to 
> call fsync(), resulting in a significant improvement in filesystem 
> usability.

Matthew, 

There were plenty of applications that were written for Unix *and*
Linux systems before ext3 existed, and they worked just fine.  Back
then, people were drilled into the fact that they needed to use
fsync(), and fsync() wan't expensive, so there wasn't a big deal in
terms of usability.  The fact that fsync() was expensive was precisely
because of ext3's data=ordered problem.  Writing files safely meant
that you had to check error returns from fsync() *and* close().  

In fact, if you care about making sure that data doesn't get lost due
to disk errors, you *must* call fsync().  Pavel may have complained
that fsync() can sometimes drop errors if some other process also has
the file open and calls fsync() --- but if you don't, and you rely on
ext3 to magically write the data blocks out as a side effect of the
commit in data=ordered mode, there's no way to signal the write error
to the application, and you are *guaranteed * to lose the I/O error
indication.

I can tell you quite authoritatively that we didn't implement
data=ordered to make life easier for application writers, and
application writers didn't come to ext3 developers asking for this
convenience.  It may have **accidentally** given them convenience that
they wanted, but it also made fsync() slow.  

> I'm utterly and screamingly bored of this "Blame userspace" attitude. 

I'm not blaming userspace.  I'm blaming ourselves, for implementing an
attractive nuisance, and not realizing that we had implemented an
attractive nuisance; which years later, is also responsible for these
latency problems, both with and without fsync() ---- *and* which have
also traied people into believing that fsync() is always expensive,
and must be avoided at all costs --- which had not previously been
true!

If I had to do it all over again, I would have argued with Stephen
about making data=writeback the default, which would have provided
behaviour on crash just like ext2, except that we wouldn't have to
fsck the partition afterwards.  Back then, people lived with the
potential security exposure on a crash, and they lived with the fact
that you had to use fsync(), or manually type "sync", if you wanted to
guarantee that data would be safely written to disk.  And you know
what?  Things had been this way with Unix systems for 31 years before
ext3 came on the scene, and things worked pretty well during those
three decades.

So again, let it make it clear, I'm not "blaming userspace".  I'm
blaming ext3 data=ordered mode.  But it's trained application writers
to program systems a certain way, and it's trained them to assume that
fsync() is always evil, and they outnumber us kernel programmers, and
so we are where we are.  And data=ordered mode is also responsible for
these write latency problems which seems to make Ingo so cranky ---
and rightly so.  It all comes from the same source.

						- Ted

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-27  5:13                             ` Theodore Tso
@ 2009-03-27  5:57                               ` Matthew Garrett
  2009-03-27  6:21                                 ` Matthew Garrett
  2009-04-03 12:39                               ` Pavel Machek
  1 sibling, 1 reply; 664+ messages in thread
From: Matthew Garrett @ 2009-03-27  5:57 UTC (permalink / raw)
  To: Theodore Tso, Linus Torvalds, Andrew Morton, David Rees,
	Jesper Krogh, Linux Kernel Mailing List

On Fri, Mar 27, 2009 at 01:13:39AM -0400, Theodore Tso wrote:

> There were plenty of applications that were written for Unix *and*
> Linux systems before ext3 existed, and they worked just fine.  Back
> then, people were drilled into the fact that they needed to use
> fsync(), and fsync() wan't expensive, so there wasn't a big deal in
> terms of usability.  The fact that fsync() was expensive was precisely
> because of ext3's data=ordered problem.  Writing files safely meant
> that you had to check error returns from fsync() *and* close().  

And now life is better. UNIX's error handling has always meant that it's 
effectively impossible to ensure that data hits disk if you wander into 
a variety of error conditions, and by and large it's simply not worth 
worrying about them. You're generally more likely to hit a kernel bug or 
suffer hardware failure than find an error condition that can actually 
be handled in a sensible way, and the probability/effectiveness ratio is 
sufficiently low that there are better ways to spend your time unless 
you're writing absolutely mission critical code. So let's not focus on 
the risk of data loss from failing to check certain error conditions. 
It's a tiny risk compared to power loss.

> I can tell you quite authoritatively that we didn't implement
> data=ordered to make life easier for application writers, and
> application writers didn't come to ext3 developers asking for this
> convenience.  It may have **accidentally** given them convenience that
> they wanted, but it also made fsync() slow.  

It not only gave them that convenience, it *guaranteed* that 
convenience. And with ext3 being the standard filesystem in the Linux 
world, and every other POSIX system being by and large irrelevant[1], 
the real world effect of that was that Linux gave you that guarantee. 

> > I'm utterly and screamingly bored of this "Blame userspace" attitude. 
> 
> I'm not blaming userspace.  I'm blaming ourselves, for implementing an
> attractive nuisance, and not realizing that we had implemented an
> attractive nuisance; which years later, is also responsible for these
> latency problems, both with and without fsync() ---- *and* which have
> also traied people into believing that fsync() is always expensive,
> and must be avoided at all costs --- which had not previously been
> true!

But you're still arguing that applications should start using fsync(). 
I'm arguing that not only is this pointless (most of this code will 
never be "fixed") but it's also regressive. In most cases applications 
don't want the guarantees that fsync() makes, and given that we're going 
to have people running on ext3 for years to come they also don't want 
the performance hit that fsync() brings. Filesystems should just do the 
right thing, rather than losing people's data and then claiming that 
it's fine because POSIX said they could.

> If I had to do it all over again, I would have argued with Stephen
> about making data=writeback the default, which would have provided
> behaviour on crash just like ext2, except that we wouldn't have to
> fsck the partition afterwards.  Back then, people lived with the
> potential security exposure on a crash, and they lived with the fact
> that you had to use fsync(), or manually type "sync", if you wanted to
> guarantee that data would be safely written to disk.  And you know
> what?  Things had been this way with Unix systems for 31 years before
> ext3 came on the scene, and things worked pretty well during those
> three decades.

Well, no. fsync() didn't appear in early Unix, so what people were 
actually willing to live with was restoring from backups if the system 
crashed. I'd argue that things are somewhat better these days, 
especially now that we're used to filesystems that don't require us to 
fsync(), close(), fsync the directory and possibly jump through even 
more hoops if faced with a pathological interpretation of POSIX. 
Progress is a good thing. The initial behaviour of ext4 in this respect 
wasn't progress.

And, really, I'm kind of amused at someone arguing for a given behaviour 
on the basis of POSIX while also suggesting that sync() is in any way 
helpful for guaranteeing that data is on disk.

> So again, let it make it clear, I'm not "blaming userspace".  I'm
> blaming ext3 data=ordered mode.  But it's trained application writers
> to program systems a certain way, and it's trained them to assume that
> fsync() is always evil, and they outnumber us kernel programmers, and
> so we are where we are.  And data=ordered mode is also responsible for
> these write latency problems which seems to make Ingo so cranky ---
> and rightly so.  It all comes from the same source.

No. People continue to use fsync() where fsync() should be used - for 
guaranteeing that given information has hit disk. The problem is that 
you're arguing that application should use fsync() even when they don't 
want or need that guarantee. If anything, ext3 has been helpful in 
encouraging people to only use fsync() when they really need to - and 
that's a win for everyone.

[1] MacOS has users, but it's not a significant market for pure POSIX 
applications so isn't really an interesting counterexample
-- 
Matthew Garrett | mjg59@srcf.ucam.org

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-27  5:57                               ` Matthew Garrett
@ 2009-03-27  6:21                                 ` Matthew Garrett
  2009-03-27 11:24                                   ` Theodore Tso
  0 siblings, 1 reply; 664+ messages in thread
From: Matthew Garrett @ 2009-03-27  6:21 UTC (permalink / raw)
  To: Theodore Tso, Linus Torvalds, Andrew Morton, David Rees,
	Jesper Krogh, Linux Kernel Mailing List

On Fri, Mar 27, 2009 at 05:57:50AM +0000, Matthew Garrett wrote:

> Well, no. fsync() didn't appear in early Unix, so what people were 
> actually willing to live with was restoring from backups if the system 
> crashed. I'd argue that things are somewhat better these days, 
> especially now that we're used to filesystems that don't require us to 
> fsync(), close(), fsync the directory and possibly jump through even 
> more hoops if faced with a pathological interpretation of POSIX. 
> Progress is a good thing. The initial behaviour of ext4 in this respect 
> wasn't progress.

And, hey, fsync didn't make POSIX proper until 1996. It's not like 
authors were able to depend on it for a significant period of time 
before ext3 hit the scene.

(It could be argued that most relevant Unices implemented fsync() even 
before then, so its status in POSIX was broadly irrelevant. The obvious 
counterargument is that most relevant Unix filesystems ensure that data 
is written before a clobbering rename() is carried out, so POSIX is 
again not especially releant)
-- 
Matthew Garrett | mjg59@srcf.ucam.org

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-25 20:25                             ` Jeff Garzik
  2009-03-25 20:40                               ` Linus Torvalds
@ 2009-03-27  7:46                               ` Jens Axboe
  1 sibling, 0 replies; 664+ messages in thread
From: Jens Axboe @ 2009-03-27  7:46 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: no, To-header, on, input <,
	; Linus Torvalds, Theodore Tso, Ingo Molnar, Alan Cox,
	Arjan van de Ven, Andrew Morton, Peter Zijlstra, Nick Piggin,
	David Rees, Jesper Krogh, Linux Kernel Mailing List


Jeff, if you drop my CC on reply, I wont see your messages for ages.

On Wed, Mar 25 2009, Jeff Garzik wrote:
> Jens Axboe wrote:
>> On Wed, Mar 25 2009, Jeff Garzik wrote:
>>> Stating "fsync already does that" borders on false, because that assumes
>>> (a) the user has a fs that supports barriers
>>> (b) the user is actually aware of a 'barriers' mount option and what 
>>> it  means
>>> (c) the user has turned on an option normally defaulted to off.
>>>
>>> Or in other words, it pretty much never happens.
>>
>> That is true, except if you use xfs/ext4. And this discussion is fine,
>> as was the one a few months back that got ext4 to enable barriers by
>> default. If I had submitted patches to do that back in 2001/2 when the
>> barrier stuff was written, I would have been shot for introducing such a
>> slow down. After people found out that it just wasn't something silly,
>> then you have a way to enable it.
>>
>> I'd still wager that most people would rather have a 'good enough
>> fsync' on their desktops than incur the penalty of barriers or write
>> through caching. I know I do.
>
> That's a strawman argument:  The choice is not between "good enough  
> fsync" and full use of barriers / write-through caching, at all.

Then let me rephrase that to "most users don't care about full integrity
fsync()". If it kills their firefox performance, most will wont to turn
it off. Personally I'd never use it on my notebook or desktop box,
simply because I don't care strongly enough. I'd rather fix things up in
the very unlikely event of a crash WITH corruption.

> It is clearly possible to implement an fsync(2) that causes FLUSH CACHE  
> to be issued, without adding full barrier support to a filesystem.  It  
> is likely doable to avoid touching per-filesystem code at all, if we  
> issue the flush from a generic fsync(2) code path in the kernel.

Of course, it would be trivial. Just add a blkdev_issue_flush() to
vfs_fsync().

> Thus, you have a "third way":  fsync(2) gives the guarantee it is  
> supposed to, but you do not take the full performance hit of  
> barriers-all-the-time.
>
> Remember, fsync(2) means that the user _expects_ a performance hit.
>
> And they took the extra step to call fsync(2) because they want a  
> guarantee, not a lie.

s/user/application.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-25 20:40                               ` Linus Torvalds
                                                   ` (2 preceding siblings ...)
  2009-03-25 21:33                                 ` Linux 2.6.29 Jeff Garzik
@ 2009-03-27  7:57                                 ` Jens Axboe
  2009-03-27 14:13                                   ` Theodore Tso
  2009-03-27 19:14                                   ` Chris Mason
  3 siblings, 2 replies; 664+ messages in thread
From: Jens Axboe @ 2009-03-27  7:57 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jeff Garzik, Theodore Tso, Ingo Molnar, Alan Cox,
	Arjan van de Ven, Andrew Morton, Peter Zijlstra, Nick Piggin,
	David Rees, Jesper Krogh, Linux Kernel Mailing List

On Wed, Mar 25 2009, Linus Torvalds wrote:
> 
> 
> On Wed, 25 Mar 2009, Jeff Garzik wrote:
> > 
> > It is clearly possible to implement an fsync(2) that causes FLUSH CACHE to be
> > issued, without adding full barrier support to a filesystem.  It is likely
> > doable to avoid touching per-filesystem code at all, if we issue the flush
> > from a generic fsync(2) code path in the kernel.
> 
> We could easily do that. It would even work for most cases. The 
> problematic ones are where filesystems do their own disk management, but I 
> guess those people can do their own fsync() management too.
> 
> Somebody send me the patch, we can try it out.

Here's a simple patch that does that. Not even tested, it compiles. Note
that file systems that currently do blkdev_issue_flush() in their
->sync() should then get it removed.

> > Remember, fsync(2) means that the user _expects_ a performance hit.
> 
> Within reason, though.
> 
> OS X, for example, doesn't do the disk barrier. It requires you to do a 
> separate FULL_FSYNC (or something similar) ioctl to get that. Apparently 
> exactly because users don't expect quite _that_ big of a performance hit.
> 
> (Or maybe just because it was easier to do that way. Never attribute to 
> malice what can be sufficiently explained by stupidity).

It'd be better to have a knob to control whether fsync() should care
about the hardware side as well, instead of trying to teach applications
to use FULL_FSYNC.

diff --git a/fs/sync.c b/fs/sync.c
index ec95a69..7a44d4e 100644
--- a/fs/sync.c
+++ b/fs/sync.c
@@ -8,6 +8,7 @@
 #include <linux/module.h>
 #include <linux/sched.h>
 #include <linux/writeback.h>
+#include <linux/blkdev.h>
 #include <linux/syscalls.h>
 #include <linux/linkage.h>
 #include <linux/pagemap.h>
@@ -104,6 +105,7 @@ int vfs_fsync(struct file *file, struct dentry *dentry, int datasync)
 {
 	const struct file_operations *fop;
 	struct address_space *mapping;
+	struct block_device *bdev;
 	int err, ret;
 
 	/*
@@ -138,6 +140,13 @@ int vfs_fsync(struct file *file, struct dentry *dentry, int datasync)
 	err = filemap_fdatawait(mapping);
 	if (!ret)
 		ret = err;
+
+	bdev = mapping->host->i_sb->s_bdev;
+	if (bdev) {
+		err = blkdev_issue_flush(bdev, NULL);
+		if (!ret)
+			ret = err;
+	}
 out:
 	return ret;
 }

-- 
Jens Axboe


^ permalink raw reply related	[flat|nested] 664+ messages in thread

* Re: [PATCH] issue storage device flush via sync_blockdev() (was Re:  Linux 2.6.29)
  2009-03-25 21:56                                   ` Eric Sandeen
  2009-03-25 23:08                                     ` Jeff Garzik
  2009-03-26  0:58                                     ` Ric Wheeler
@ 2009-03-27  7:59                                     ` Jens Axboe
  2 siblings, 0 replies; 664+ messages in thread
From: Jens Axboe @ 2009-03-27  7:59 UTC (permalink / raw)
  To: Eric Sandeen
  Cc: Jeff Garzik, Linus Torvalds, Theodore Tso, Ingo Molnar, Alan Cox,
	Arjan van de Ven, Andrew Morton, Peter Zijlstra, Nick Piggin,
	David Rees, Jesper Krogh, Linux Kernel Mailing List

On Wed, Mar 25 2009, Eric Sandeen wrote:
> Jeff Garzik wrote:
> > On Wed, Mar 25, 2009 at 01:40:37PM -0700, Linus Torvalds wrote:
> >> On Wed, 25 Mar 2009, Jeff Garzik wrote:
> >>> It is clearly possible to implement an fsync(2) that causes FLUSH CACHE to be
> >>> issued, without adding full barrier support to a filesystem.  It is likely
> >>> doable to avoid touching per-filesystem code at all, if we issue the flush
> >>> from a generic fsync(2) code path in the kernel.
> >> We could easily do that. It would even work for most cases. The 
> >> problematic ones are where filesystems do their own disk management, but I 
> >> guess those people can do their own fsync() management too.
> >>
> >> Somebody send me the patch, we can try it out.
> > 
> > This is a simple step that would cover a lot of cases...  sync(2)
> > calls sync_blockdev(), and many filesystems do as well via the generic
> > filesystem helper file_fsync (fs/sync.c).
> > 
> > XFS code calls sync_blockdev() a "big hammer", so I hope my patch
> > follows with known practice.
> > 
> > Looking over every use of sync_blockdev(), its most frequent use is
> > through fsync(2), for the selected filesystems that use the generic
> > file_fsync helper.
> > 
> > Most callers of sync_blockdev() in the kernel do so infrequently,
> > when removing and invalidating volumes (MD) or storing the superblock
> > prior to release (put_super) in some filesystems.
> > 
> > Compile-tested only, of course :)  But it should be work :)
> > 
> > My main concern is some hidden area that calls sync_blockdev() with
> > a high-enough frequency that the performance hit is bad.
> > 
> > Signed-off-by: Jeff Garzik <jgarzik@redhat.com>
> > 
> > diff --git a/fs/buffer.c b/fs/buffer.c
> > index 891e1c7..7b9f74a 100644
> > --- a/fs/buffer.c
> > +++ b/fs/buffer.c
> > @@ -173,9 +173,14 @@ int sync_blockdev(struct block_device *bdev)
> >  {
> >  	int ret = 0;
> >  
> > -	if (bdev)
> > -		ret = filemap_write_and_wait(bdev->bd_inode->i_mapping);
> > -	return ret;
> > +	if (!bdev)
> > +		return 0;
> > +	
> > +	ret = filemap_write_and_wait(bdev->bd_inode->i_mapping);
> > +	if (ret)
> > +		return ret;
> > +	
> > +	return blkdev_issue_flush(bdev, NULL);
> >  }
> >  EXPORT_SYMBOL(sync_blockdev);
> 
> What about when you're running over a big raid device with
> battery-backed cache, and you trust the cache as much as much as the
> disks.  Wouldn't this unconditional cache flush be painful there on any
> of the callers even if they're rare?  (fs unmounts, freezes, unmounts,
> etc?  Or a fat filesystem on that device doing an fsync?)
> 
> xfs, reiserfs, ext4 all avoid the blkdev flush on fsync if barriers are
> not enabled, I think for that reason...
> 
> (I'm assuming these raid devices still honor a cache flush request even
> if they're battery-backed?  I dunno.)

I think most don't, as they realize it's a data integrity thing and that
doesn't apply if you don't lose data on powerloss. But, I'm sure there
are also "dumb" ones that DO flush the cache. In which case the flush is
utterly hopeless and should not be done.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-27  0:11                 ` Andrew Morton
  2009-03-27  0:27                   ` Linus Torvalds
@ 2009-03-27  9:58                   ` Alan Cox
  1 sibling, 0 replies; 664+ messages in thread
From: Alan Cox @ 2009-03-27  9:58 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, Theodore Tso, David Rees, Jesper Krogh,
	Linux Kernel Mailing List

> userspace can do it quite easily.  Run a self-tuning script after
> installation and when the disk hardware changes significantly.

Which is "all the time" in some configurations. It really needs to be
self tuning internally based on the observed achieved rates (just as you
don't use a script to tune your network bandwidth each day)

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: ext3 IO latency measurements (was: Linux 2.6.29)
  2009-03-26 23:04                                                   ` Bron Gondwana
@ 2009-03-27 11:22                                                     ` Alan Cox
  2009-03-27 12:19                                                       ` Bron Gondwana
  0 siblings, 1 reply; 664+ messages in thread
From: Alan Cox @ 2009-03-27 11:22 UTC (permalink / raw)
  To: Bron Gondwana
  Cc: Matthew Garrett, Linus Torvalds, Theodore Tso, Ingo Molnar,
	Jan Kara, Andrew Morton, Arjan van de Ven, Peter Zijlstra,
	Nick Piggin, Jens Axboe, David Rees, Jesper Krogh,
	Linux Kernel Mailing List, Oleg Nesterov, Roland McGrath

> Is this the same Alan Cox who thought a couple of months ago that
> having an insanely low default maximum number epoll instances was a
> reasonable answer to a theoretical DoS risk, despite it breaking
> pretty much every reasonable user of the epoll interface?

In the short term yes - because security has to be a very high priority.
Lesser of two evils.

Alan

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-27  6:21                                 ` Matthew Garrett
@ 2009-03-27 11:24                                   ` Theodore Tso
  2009-03-27 12:17                                     ` Linux 2.6.29 - delayed metadata for delayed allocation? Andreas T.Auer
                                                       ` (2 more replies)
  0 siblings, 3 replies; 664+ messages in thread
From: Theodore Tso @ 2009-03-27 11:24 UTC (permalink / raw)
  To: Matthew Garrett
  Cc: Linus Torvalds, Andrew Morton, David Rees, Jesper Krogh,
	Linux Kernel Mailing List

On Fri, Mar 27, 2009 at 06:21:14AM +0000, Matthew Garrett wrote:
> And, hey, fsync didn't make POSIX proper until 1996. It's not like 
> authors were able to depend on it for a significant period of time 
> before ext3 hit the scene.

Fsync() was in BSD 4.3 and it was in much earlier Unix specifications,
such as SVID, well before it appeared in POSIX.  If an interface was
in both BSD and AT&T System V Unix, it was around everywhere.

> (It could be argued that most relevant Unices implemented fsync() even 
> before then, so its status in POSIX was broadly irrelevant. The obvious 
> counterargument is that most relevant Unix filesystems ensure that data 
> is written before a clobbering rename() is carried out, so POSIX is 
> again not especially releant)

Nope, not true.  Most relevant Unix file systems sync'ed data blocks
on a 30 timer, and metadata on 5 second timers.  They did *not* force
data to be written before a clobbering rename() was carried you;
you're rewriting history when you say that; it's simply not true.
Rename was atomic *only* where metadata was concerned, and all the
talk about rename being atomic was because back then we didn't have
flock() and you built locking primitives open(O_CREAT) and rename();
but that was only metadata, and that was only if the system didn't
crash.

When I was growing up we were trained to *always* check error returns
from *all* system calls, and to *always* fsync() if it was critical
that the data survive a crash.  That was what competent Unix
programmers did.  And if you are always checking error returns, the
difference in the Lines of Code between doing it right and doing
really wasn't that big --- and again, back then fsync() wan't
expensive.  Making fsync expensive was ext3's data=ordered mode's
fault.

Then again, most users or system administrators of Unix systems didn't
tolerate device drivers that would crash your system when you exited a
game, either....  and I've said that I recognize the world has changed
and that crappy application programmers outnumber kernel programers,
which is why I coded the workaround for ext4.  That still doesn't make
what they are doing correct.

     	   	      	     	      	     - Ted

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: relatime: update once per day patches (was: ext3 IO latency  measurements)
  2009-03-26 19:20                                                     ` Andrew Morton
  2009-03-26 19:43                                                       ` Matthew Garrett
@ 2009-03-27 11:25                                                       ` David Hagood
  1 sibling, 0 replies; 664+ messages in thread
From: David Hagood @ 2009-03-27 11:25 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Matthew Garrett, Frans Pop, Linus Torvalds, mingo, tytso, jack,
	alan, arjan, a.p.zijlstra, npiggin, jens.axboe, drees76, jesper,
	linux-kernel, oleg, roland

It seems to me that, rather than having the kernel maintain a timer (or
multiple timers, one per mount) itself, it would make sense to have
entries in /sys which, when written to, cause the file system layer to
flush all atime data to the mounted volume.

Something like
/sys
/sys/atime
/sys/atime/all
/sys/atime/<mountpoint id>/flush

where <mountpoint id> would be the name of the file system
(e.g. /sys/atime/usr/flush).

The only sticky part would be how to describe "/" in such a system.

(Better still would be a /sys/ system for each file system with the
various parameters (e.g. uid, journal) as entries + an entry for
flushing atime, but that is beyond the scope of this discussion.)

That would truly let userspace set policy, while the kernel provides
mechanism. Thus, a script that depends upon atime being accurate could
simply tickle the sysfs entries as needed before running.



^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-24 23:03               ` Jesse Barnes
  2009-03-25  0:05                 ` Arjan van de Ven
  2009-03-25  2:09                 ` Theodore Tso
@ 2009-03-27 11:27                 ` Martin Steigerwald
  2 siblings, 0 replies; 664+ messages in thread
From: Martin Steigerwald @ 2009-03-27 11:27 UTC (permalink / raw)
  To: Jesse Barnes
  Cc: Theodore Tso, Ingo Molnar, Alan Cox, Arjan van de Ven,
	Andrew Morton, Peter Zijlstra, Nick Piggin, Jens Axboe,
	David Rees, Jesper Krogh, Linus Torvalds,
	Linux Kernel Mailing List

[-- Attachment #1: Type: text/plain, Size: 3021 bytes --]

Am Mittwoch 25 März 2009 schrieb Jesse Barnes:
> On Tue, 24 Mar 2009 09:20:32 -0400
>
> Theodore Tso <tytso@mit.edu> wrote:
> > They don't solve the problem where there is a *huge* amount of writes
> > going on, though --- if something is dirtying pages at a rate far
> > greater than the local disk can write it out, say, either "dd
> > if=/dev/zero of=/mnt/make-lots-of-writes" or a massive distcc cluster
> > driving a huge amount of data towards a single system or a wget over
> > a local 100 megabit ethernet from a massive NFS server where
> > everything is in cache, then you can have a major delay with the
> > fsync().
>
> You make it sound like this is hard to do...  I was running into this
> problem *every day* until I moved to XFS recently.  I'm running a
> fairly beefy desktop (VMware running a crappy Windows install w/AV junk
> on it, builds, icecream and large mailboxes) and have a lot of RAM, but
> it became unusable for minutes at a time, which was just totally
> unacceptable, thus the switch.  Things have been better since, but are
> still a little choppy.
>
> I remember early in the 2.6.x days there was a lot of focus on making
> interactive performance good, and for a long time it was.  But this I/O
> problem has been around for a *long* time now... What happened?  Do not
> many people run into this daily?  Do all the filesystem hackers run
> with special mount options to mitigate the problem?

Well I always had the feeling that somewhen from one 2.6.x to another I/O 
latencies increased a lot. But first I thought I was just imaging this 
and when I more and more thought that this is for real, I forgot since 
when I observed these increased latencies.

This is on IBM ThinkPad T42 and T23 with XFS.

I/O latencies are pathetic when dpkg reads in the database or I do tar -xf 
linux-x.y.z.tar.bz2.

I never got down to what is causing these higher latencies though also I 
tried different I/O schedulers, tuned XFS options, used relatime.

What I found tough is that on XFS at least a tar -xf linux-kernel / rm -rf 
linux-kernel operation is way slower with barriers and write cache 
enabled that with no barriers and no write cache enabled. And frankly I 
never got that.

XFS crawls to a stop on metadata operations when barriers are enabled. 
According to the XFS FAQ disabling drive write cache should be as safe as 
enabling barriers. And I always unterstood barriers as a feature to be 
have *some* ordering contraints, i.e. write before barrier go before 
barrier and writes after it after it - even when a drives hardware write 
cache is involved. But when this cache is enabled ordering will always be 
like issued from Linux block layer cause all I/Os issued to the drive are 
write-through and synchron without write cache, versus only barrier 
requests are synchron with barriers and write cache.

-- 
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA  B82F 991B EAAC A599 84C7

[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 197 bytes --]

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: ext3 IO latency measurements
  2009-03-26 18:59                                             ` ext3 IO latency measurements (was: Linux 2.6.29) Alan Cox
  2009-03-26 20:02                                               ` Matthew Garrett
@ 2009-03-27 12:00                                               ` Giacomo A. Catenazzi
  1 sibling, 0 replies; 664+ messages in thread
From: Giacomo A. Catenazzi @ 2009-03-27 12:00 UTC (permalink / raw)
  To: Alan Cox
  Cc: Linus Torvalds, Theodore Tso, Ingo Molnar, Jan Kara,
	Andrew Morton, Arjan van de Ven, Peter Zijlstra, Nick Piggin,
	Jens Axboe, David Rees, Jesper Krogh, Linux Kernel Mailing List,
	Oleg Nesterov, Roland McGrath

Alan Cox wrote:
>> And what's the argument for not doing it in the kernel?
>>
>> The fact is, "atime" by default is just wrong.
> 
> It probably was a wrong default - twenty years ago. Actually it may well
> have been a wrong default in Unix v6 8)
> 
> However
> - atime behaviour is SuS required

so I propose an other mount option along to strictatime:
nowatime: it give the actual time as atime:
it is totally useless, but fast *and* POSIX compatible:
- no disk writes on accesses
- POSIX doesn't mandate the behaviour of other processes, so
   we simulate that fs are scanned at every fs-tick.
- IMHO more programs break, but in this case only
   the POSIX incompatible programs.


> - there are users with systems out there using atime and dependant on
>   proper atime

This is the real problem.

ciao
	cate


^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29 - delayed metadata for delayed allocation?
  2009-03-27 11:24                                   ` Theodore Tso
@ 2009-03-27 12:17                                     ` Andreas T.Auer
  2009-03-27 14:51                                     ` Linux 2.6.29 Matthew Garrett
  2009-03-27 21:11                                     ` Jeremy Fitzhardinge
  2 siblings, 0 replies; 664+ messages in thread
From: Andreas T.Auer @ 2009-03-27 12:17 UTC (permalink / raw)
  To: Theodore Tso, Linux Kernel Mailing List
  Cc: Matthew Garrett, Linus Torvalds, Andrew Morton, David Rees, Jesper Krogh

On 2009-03-27 at 12:24 Theodore Tso wrote:
> When I was growing up we were trained to *always* check error returns
> from *all* system calls, and to *always* fsync() if it was critical
> that the data survive a crash.
But there are a lot of applications for which the survival of the data
is not this critical as long as the old data is still available.

Data are the important stuff, metadata helps to find them. Even though
there are a lot of cases, where the information is just stored in the
metadata.
If you write metadata for not-yet-existing data to disk, then these are
inconsistent, corrupt, dirty.

Why don't you just delay the writing of these dirty metadata, too, until
they are clean? So nothing is written until the next sync and then

1) write the data to the nicely allocated places.
2) journal the metadata for consistency
3) write the metadata
4) cleanup the journal

That way you can have sophisticated allocation and keep a consistent
filesystem without data loss due to re-ordering.

Clean metadata-changes which don't have delayed data might be
written/journaled immediately. That rises the question, whether dirty
metadata changes should be skipped or whether a dirty metadata change
should block later clean metadata changes to inhibit the re-ordering of
changes. This should be a mount-option IMHO. Keeping the order of
fs-changes has a big advantage in many cases.

Syncing data on renames would decrease your performance which you want
to increase with delayed allocation. Delayed metadata would mostly keep
this performance gain, right?

Andreas






^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: ext3 IO latency measurements (was: Linux 2.6.29)
  2009-03-27 11:22                                                     ` Alan Cox
@ 2009-03-27 12:19                                                       ` Bron Gondwana
  2009-03-27 13:56                                                         ` Alan Cox
  0 siblings, 1 reply; 664+ messages in thread
From: Bron Gondwana @ 2009-03-27 12:19 UTC (permalink / raw)
  To: Alan Cox
  Cc: Bron Gondwana, Matthew Garrett, Linus Torvalds, Theodore Tso,
	Ingo Molnar, Jan Kara, Andrew Morton, Arjan van de Ven,
	Peter Zijlstra, Nick Piggin, Jens Axboe, David Rees,
	Jesper Krogh, Linux Kernel Mailing List, Oleg Nesterov,
	Roland McGrath

On Fri, Mar 27, 2009 at 11:22:48AM +0000, Alan Cox wrote:
> > Is this the same Alan Cox who thought a couple of months ago that
> > having an insanely low default maximum number epoll instances was a
> > reasonable answer to a theoretical DoS risk, despite it breaking
> > pretty much every reasonable user of the epoll interface?
> 
> In the short term yes - because security has to be a very high priority.
> Lesser of two evils.

So turn the machine off.

It seems to me that having atime turned on is a DoS risk.  Any punk
can cause lots of disk IO that will make everyone else's fsync's
turn into molasses simply by reading lots of files.  ZOMG (as the
kiddies of today would say) - we'd better fix this DoS risk by
disabling or rate limiting this dangeous vector (eleventyone!)

Bron ( ok, I'm getting a bit silly here - but if we blocked every
       potential DoS by making sure a single user could only use a
       small percentage of the machine's total capacity at maximum... )

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-23 23:29 Linux 2.6.29 Linus Torvalds
  2009-03-24  6:19 ` Jesper Krogh
  2009-03-24 13:02 ` Revert "gro: Fix legacy path napi_complete crash", (was: Re: Linux 2.6.29) Ingo Molnar
@ 2009-03-27 13:35 ` Hans-Peter Jansen
  2009-03-27 14:53   ` Geert Uytterhoeven
  2009-03-27 16:49   ` Frans Pop
  2 siblings, 2 replies; 664+ messages in thread
From: Hans-Peter Jansen @ 2009-03-27 13:35 UTC (permalink / raw)
  To: linux-kernel; +Cc: Linus Torvalds

Am Dienstag, 24. März 2009 schrieb Linus Torvalds:
>
> This obviously starts the merge window for 2.6.30, although as usual,
> I'll probably wait a day or two before I start actively merging.

It would be very nice, if you could start with a commit to Makefile, that 
reflects the new series: e.g.:

VERSION = 2
PATCHLEVEL = 6
SUBLEVEL = 30
EXTRAVERSION = -pre

-pre for preparing state.

Thanks,
Pete

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: ext3 IO latency measurements (was: Linux 2.6.29)
  2009-03-26 22:24                                               ` Linus Torvalds
@ 2009-03-27 13:47                                                 ` Bill Nottingham
  0 siblings, 0 replies; 664+ messages in thread
From: Bill Nottingham @ 2009-03-27 13:47 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Theodore Tso, Ingo Molnar, Jan Kara, Andrew Morton, Alan Cox,
	Arjan van de Ven, Peter Zijlstra, Nick Piggin, Jens Axboe,
	David Rees, Jesper Krogh, Linux Kernel Mailing List,
	Oleg Nesterov, Roland McGrath

Linus Torvalds (torvalds@linux-foundation.org) said: 
> > Well, it's got to find the root fs options somewhere. Pulling them
> > from the modified /etc/fstab in the root fs before you mount it, well...
> 
> Umm.
> 
> The _only_ sane thng to do is to mount the root read-only from initramfs, 
> and then re-mount it with the options in the /etc/fstab later when you 
> re-mount it read-write _anyway_ (which may possibly be immediately, of 
> course).

Sure, and as said, as soon as you try to specify journal options (and
possibly others), this immediately fails. You can apply the options
one at a time, and decide some aren't fatal, or you can actually have
your later remount have code to drop specific options, requiring
implementation knowledge of any filesystem to be used. Or you say
people who specify journal options in fstab don't get to boot.

But if you blindly attempt to apply fstab options later in the remount,
some options will break.

Bill

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: ext3 IO latency measurements (was: Linux 2.6.29)
  2009-03-27 12:19                                                       ` Bron Gondwana
@ 2009-03-27 13:56                                                         ` Alan Cox
  0 siblings, 0 replies; 664+ messages in thread
From: Alan Cox @ 2009-03-27 13:56 UTC (permalink / raw)
  To: Bron Gondwana
  Cc: Bron Gondwana, Matthew Garrett, Linus Torvalds, Theodore Tso,
	Ingo Molnar, Jan Kara, Andrew Morton, Arjan van de Ven,
	Peter Zijlstra, Nick Piggin, Jens Axboe, David Rees,
	Jesper Krogh, Linux Kernel Mailing List, Oleg Nesterov,
	Roland McGrath

> Bron ( ok, I'm getting a bit silly here

Yes you are - completely.

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: [PATCH 2/2] Make relatime default
  2009-03-27  0:30                                                       ` Linus Torvalds
@ 2009-03-27 14:06                                                         ` Alan Cox
  0 siblings, 0 replies; 664+ messages in thread
From: Alan Cox @ 2009-03-27 14:06 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Frans Pop, mjg, tytso, mingo, jack, akpm, arjan, a.p.zijlstra,
	npiggin, jens.axboe, drees76, jesper, linux-kernel, oleg, roland

> Why? RELATIME has been around since 2006 now.

The workable fixes to relatime (the always update once per 24 hours) you
only just comitted - and did come from a vendor.

It also looks btw that we don't want to have a "relatime" option and a
"strictatime" option and a "relatimebutdoitevery24hrs" option.

All three of these are the same thing so it should (regardless of default
choice) be 

	relatime=n

	n = 0 ('update if its more than 0 seconds out of date') =
	strictatime
	n = MAXINT (basically equals relatime)
	n = 24hrs (the new 'fixed' relatime but not too relative)

	n = anything else - user tuned

Alan

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-27  7:57                                 ` Jens Axboe
@ 2009-03-27 14:13                                   ` Theodore Tso
  2009-03-27 14:35                                     ` Christoph Hellwig
  2009-03-27 19:14                                   ` Chris Mason
  1 sibling, 1 reply; 664+ messages in thread
From: Theodore Tso @ 2009-03-27 14:13 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Linus Torvalds, Jeff Garzik, Ingo Molnar, Alan Cox,
	Arjan van de Ven, Andrew Morton, Peter Zijlstra, Nick Piggin,
	David Rees, Jesper Krogh, Linux Kernel Mailing List

On Fri, Mar 27, 2009 at 08:57:23AM +0100, Jens Axboe wrote:
> 
> Here's a simple patch that does that. Not even tested, it compiles. Note
> that file systems that currently do blkdev_issue_flush() in their
> ->sync() should then get it removed.
> 

That's going to be a mess.  Ext3 implements an fsync() by requesting a
journal commit, and then waiting for the commit to have taken place.
The commit happens in another thread, kjournald.  Knowing when it's OK
not to do a blkdev_issue_flush() because the commit was triggered by
an fsync() is going to be really messy.  Could we at least have a flag
in struct super which says, "We'll handle the flush correctly, please
don't try to do it for us?"

						- Ted


^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-27 14:13                                   ` Theodore Tso
@ 2009-03-27 14:35                                     ` Christoph Hellwig
  2009-03-27 15:03                                       ` Ric Wheeler
  2009-03-27 20:38                                       ` Jeff Garzik
  0 siblings, 2 replies; 664+ messages in thread
From: Christoph Hellwig @ 2009-03-27 14:35 UTC (permalink / raw)
  To: Theodore Tso, Jens Axboe, Linus Torvalds, Jeff Garzik,
	Ingo Molnar, Alan Cox, Arjan van de Ven, Andrew Morton,
	Peter Zijlstra, Nick Piggin, David Rees, Jesper Krogh,
	Linux Kernel Mailing List

On Fri, Mar 27, 2009 at 10:13:33AM -0400, Theodore Tso wrote:
> On Fri, Mar 27, 2009 at 08:57:23AM +0100, Jens Axboe wrote:
> > 
> > Here's a simple patch that does that. Not even tested, it compiles. Note
> > that file systems that currently do blkdev_issue_flush() in their
> > ->sync() should then get it removed.
> > 
> 
> That's going to be a mess.  Ext3 implements an fsync() by requesting a
> journal commit, and then waiting for the commit to have taken place.
> The commit happens in another thread, kjournald.  Knowing when it's OK
> not to do a blkdev_issue_flush() because the commit was triggered by
> an fsync() is going to be really messy.  Could we at least have a flag
> in struct super which says, "We'll handle the flush correctly, please
> don't try to do it for us?"

Doing it in vfs_fsync also is completely wrong layering.  If people want
it for simple filesystems add it to file_fsync instead of messing up
the generic helper.  Removing well meaning but ill behaved policy from
the generic path has been costing me far too much time lately.

And please add a tuneable for the flush.  Preferable a generic one at
the block device layer instead of the current mess where every
filesystem has a slightly different option for barrier usage.

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-27 11:24                                   ` Theodore Tso
  2009-03-27 12:17                                     ` Linux 2.6.29 - delayed metadata for delayed allocation? Andreas T.Auer
@ 2009-03-27 14:51                                     ` Matthew Garrett
  2009-03-27 15:08                                       ` Alan Cox
  2009-03-27 15:20                                       ` Giacomo A. Catenazzi
  2009-03-27 21:11                                     ` Jeremy Fitzhardinge
  2 siblings, 2 replies; 664+ messages in thread
From: Matthew Garrett @ 2009-03-27 14:51 UTC (permalink / raw)
  To: Theodore Tso, Linus Torvalds, Andrew Morton, David Rees,
	Jesper Krogh, Linux Kernel Mailing List

On Fri, Mar 27, 2009 at 07:24:38AM -0400, Theodore Tso wrote:
> On Fri, Mar 27, 2009 at 06:21:14AM +0000, Matthew Garrett wrote:
> > And, hey, fsync didn't make POSIX proper until 1996. It's not like 
> > authors were able to depend on it for a significant period of time 
> > before ext3 hit the scene.
> 
> Fsync() was in BSD 4.3 and it was in much earlier Unix specifications,
> such as SVID, well before it appeared in POSIX.  If an interface was
> in both BSD and AT&T System V Unix, it was around everywhere.

And if a behaviour is in ext3, then for the vast majority of practical 
purposes it exists everywere. Users of non-Linux POSIX operating systems 
are niche. Users of non-ext3 filesystems on Linux are niche.

> > (It could be argued that most relevant Unices implemented fsync() even 
> > before then, so its status in POSIX was broadly irrelevant. The obvious 
> > counterargument is that most relevant Unix filesystems ensure that data 
> > is written before a clobbering rename() is carried out, so POSIX is 
> > again not especially releant)
> 
> Nope, not true.  Most relevant Unix file systems sync'ed data blocks
> on a 30 timer, and metadata on 5 second timers.  They did *not* force
> data to be written before a clobbering rename() was carried you;
> you're rewriting history when you say that; it's simply not true.
> Rename was atomic *only* where metadata was concerned, and all the
> talk about rename being atomic was because back then we didn't have
> flock() and you built locking primitives open(O_CREAT) and rename();
> but that was only metadata, and that was only if the system didn't
> crash.

No, you're missing my point. The other Unix file systems are irrelevant. 
The number of people running them and having any real risk of system 
crash is small, and they're the ones with full system backups anyway.

> When I was growing up we were trained to *always* check error returns
> from *all* system calls, and to *always* fsync() if it was critical
> that the data survive a crash.  That was what competent Unix
> programmers did.  And if you are always checking error returns, the
> difference in the Lines of Code between doing it right and doing
> really wasn't that big --- and again, back then fsync() wan't
> expensive.  Making fsync expensive was ext3's data=ordered mode's
> fault.

When my grandmother was growing up she had to use an outside toilet. 
Sometimes the past sucked and we're glad of progress being made.

> Then again, most users or system administrators of Unix systems didn't
> tolerate device drivers that would crash your system when you exited a
> game, either....  and I've said that I recognize the world has changed
> and that crappy application programmers outnumber kernel programers,
> which is why I coded the workaround for ext4.  That still doesn't make
> what they are doing correct.

No, look, you're blaming userspace again. Stop it.
-- 
Matthew Garrett | mjg59@srcf.ucam.org

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-27 13:35 ` Linux 2.6.29 Hans-Peter Jansen
@ 2009-03-27 14:53   ` Geert Uytterhoeven
  2009-03-27 15:46     ` Mike Galbraith
  2009-03-27 16:49   ` Frans Pop
  1 sibling, 1 reply; 664+ messages in thread
From: Geert Uytterhoeven @ 2009-03-27 14:53 UTC (permalink / raw)
  To: Hans-Peter Jansen; +Cc: linux-kernel, Linus Torvalds

On Fri, 27 Mar 2009, Hans-Peter Jansen wrote:
> Am Dienstag, 24. März 2009 schrieb Linus Torvalds:
> > This obviously starts the merge window for 2.6.30, although as usual,
> > I'll probably wait a day or two before I start actively merging.
> 
> It would be very nice, if you could start with a commit to Makefile, that 
> reflects the new series: e.g.:
> 
> VERSION = 2
> PATCHLEVEL = 6
> SUBLEVEL = 30
> EXTRAVERSION = -pre
> 
> -pre for preparing state.

If you're using the kernel-of-they-day, you're probably using git, and
CONFIG_LOCALVERSION_AUTO=y should be mandatory.

My kernel is called 2.6.29-03321-gbe0ea69...

With kind regards,

Geert Uytterhoeven
Software Architect

Sony Techsoft Centre Europe
The Corporate Village · Da Vincilaan 7-D1 · B-1935 Zaventem · Belgium

Phone:    +32 (0)2 700 8453
Fax:      +32 (0)2 700 8622
E-mail:   Geert.Uytterhoeven@sonycom.com
Internet: http://www.sony-europe.com/

A division of Sony Europe (Belgium) N.V.
VAT BE 0413.825.160 · RPR Brussels
Fortis · BIC GEBABEBB · IBAN BE41293037680010

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-27 14:35                                     ` Christoph Hellwig
@ 2009-03-27 15:03                                       ` Ric Wheeler
  2009-03-27 20:38                                       ` Jeff Garzik
  1 sibling, 0 replies; 664+ messages in thread
From: Ric Wheeler @ 2009-03-27 15:03 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Theodore Tso, Jens Axboe, Linus Torvalds, Jeff Garzik,
	Ingo Molnar, Alan Cox, Arjan van de Ven, Andrew Morton,
	Peter Zijlstra, Nick Piggin, David Rees, Jesper Krogh,
	Linux Kernel Mailing List

Christoph Hellwig wrote:
> On Fri, Mar 27, 2009 at 10:13:33AM -0400, Theodore Tso wrote:
>   
>> On Fri, Mar 27, 2009 at 08:57:23AM +0100, Jens Axboe wrote:
>>     
>>> Here's a simple patch that does that. Not even tested, it compiles. Note
>>> that file systems that currently do blkdev_issue_flush() in their
>>> ->sync() should then get it removed.
>>>
>>>       
>> That's going to be a mess.  Ext3 implements an fsync() by requesting a
>> journal commit, and then waiting for the commit to have taken place.
>> The commit happens in another thread, kjournald.  Knowing when it's OK
>> not to do a blkdev_issue_flush() because the commit was triggered by
>> an fsync() is going to be really messy.  Could we at least have a flag
>> in struct super which says, "We'll handle the flush correctly, please
>> don't try to do it for us?"
>>     
>
> Doing it in vfs_fsync also is completely wrong layering.  If people want
> it for simple filesystems add it to file_fsync instead of messing up
> the generic helper.  Removing well meaning but ill behaved policy from
> the generic path has been costing me far too much time lately.
>
> And please add a tuneable for the flush.  Preferable a generic one at
> the block device layer instead of the current mess where every
> filesystem has a slightly different option for barrier usage.
>   

I agree that we need to be careful not to put extra device flushes if 
the file system handles this properly. They can be quite expensive (say 
10-20ms on a busy s-ata disk).

I have also seen some SSD devices have performance that drops into the 
toilet when you start flushing their volatile caches.

ric


^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-27 14:51                                     ` Linux 2.6.29 Matthew Garrett
@ 2009-03-27 15:08                                       ` Alan Cox
  2009-03-27 15:22                                         ` Matthew Garrett
  2009-03-27 15:20                                       ` Giacomo A. Catenazzi
  1 sibling, 1 reply; 664+ messages in thread
From: Alan Cox @ 2009-03-27 15:08 UTC (permalink / raw)
  To: Matthew Garrett
  Cc: Theodore Tso, Linus Torvalds, Andrew Morton, David Rees,
	Jesper Krogh, Linux Kernel Mailing List

> And if a behaviour is in ext3, then for the vast majority of practical 
> purposes it exists everywere. Users of non-Linux POSIX operating systems 
> are niche. Users of non-ext3 filesystems on Linux are niche.

SuSE for years shipped reiserfs as a default.

> When my grandmother was growing up she had to use an outside toilet. 
> Sometimes the past sucked and we're glad of progress being made.

Not checking for errors is not "progress" its indiscipline aided by
languages and tools that permit it to occur without issuing errors. It's
why software "engineering" is at best approaching early 1950's real
engineering practice ("hey gee we should test this stuff") and has yet to
grow up and get anywhere into the world of real engineering and quality.

Alan

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-27 14:51                                     ` Linux 2.6.29 Matthew Garrett
  2009-03-27 15:08                                       ` Alan Cox
@ 2009-03-27 15:20                                       ` Giacomo A. Catenazzi
  1 sibling, 0 replies; 664+ messages in thread
From: Giacomo A. Catenazzi @ 2009-03-27 15:20 UTC (permalink / raw)
  To: Matthew Garrett
  Cc: Theodore Tso, Linus Torvalds, Andrew Morton, David Rees,
	Jesper Krogh, Linux Kernel Mailing List

Matthew Garrett wrote:
> On Fri, Mar 27, 2009 at 07:24:38AM -0400, Theodore Tso wrote:
>> On Fri, Mar 27, 2009 at 06:21:14AM +0000, Matthew Garrett wrote:
>>> (It could be argued that most relevant Unices implemented fsync() even 
>>> before then, so its status in POSIX was broadly irrelevant. The obvious 
>>> counterargument is that most relevant Unix filesystems ensure that data 
>>> is written before a clobbering rename() is carried out, so POSIX is 
>>> again not especially releant)
>> Nope, not true.  Most relevant Unix file systems sync'ed data blocks
>> on a 30 timer, and metadata on 5 second timers.  They did *not* force
>> data to be written before a clobbering rename() was carried you;
>> you're rewriting history when you say that; it's simply not true.
>> Rename was atomic *only* where metadata was concerned, and all the
>> talk about rename being atomic was because back then we didn't have
>> flock() and you built locking primitives open(O_CREAT) and rename();
>> but that was only metadata, and that was only if the system didn't
>> crash.
> 
> No, you're missing my point. The other Unix file systems are irrelevant. 
> The number of people running them and having any real risk of system 
> crash is small, and they're the ones with full system backups anyway.


Are you telling us that the "Linux compatible" really means
"Linux compatible, but only on ext3, only on x86,
only on Ubuntu, only Gnome or KDE [1]"?
If a program crashes on other setups, is it not a
problem of the program but of the environment?

sigh
	cate


[1]Yes, I just see a installation script that expect one of the
two environment.


^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-27 15:08                                       ` Alan Cox
@ 2009-03-27 15:22                                         ` Matthew Garrett
  2009-03-27 16:15                                           ` Alan Cox
  0 siblings, 1 reply; 664+ messages in thread
From: Matthew Garrett @ 2009-03-27 15:22 UTC (permalink / raw)
  To: Alan Cox
  Cc: Theodore Tso, Linus Torvalds, Andrew Morton, David Rees,
	Jesper Krogh, Linux Kernel Mailing List

On Fri, Mar 27, 2009 at 03:08:11PM +0000, Alan Cox wrote:

> Not checking for errors is not "progress" its indiscipline aided by
> languages and tools that permit it to occur without issuing errors. It's
> why software "engineering" is at best approaching early 1950's real
> engineering practice ("hey gee we should test this stuff") and has yet to
> grow up and get anywhere into the world of real engineering and quality.

No. Not *having* to check for errors in the cases that you care about is 
progress. How much of the core kernel actually deals with kmalloc 
failures sensibly? Some things just aren't worth it.

-- 
Matthew Garrett | mjg59@srcf.ucam.org

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-27 14:53   ` Geert Uytterhoeven
@ 2009-03-27 15:46     ` Mike Galbraith
  2009-03-27 16:02       ` Linus Torvalds
  0 siblings, 1 reply; 664+ messages in thread
From: Mike Galbraith @ 2009-03-27 15:46 UTC (permalink / raw)
  To: Geert Uytterhoeven; +Cc: Hans-Peter Jansen, linux-kernel, Linus Torvalds

On Fri, 2009-03-27 at 15:53 +0100, Geert Uytterhoeven wrote:
> On Fri, 27 Mar 2009, Hans-Peter Jansen wrote:
> > Am Dienstag, 24. März 2009 schrieb Linus Torvalds:
> > > This obviously starts the merge window for 2.6.30, although as usual,
> > > I'll probably wait a day or two before I start actively merging.
> > 
> > It would be very nice, if you could start with a commit to Makefile, that 
> > reflects the new series: e.g.:
> > 
> > VERSION = 2
> > PATCHLEVEL = 6
> > SUBLEVEL = 30
> > EXTRAVERSION = -pre
> > 
> > -pre for preparing state.
> 
> If you're using the kernel-of-they-day, you're probably using git, and
> CONFIG_LOCALVERSION_AUTO=y should be mandatory.

I sure hope it never becomes mandatory, I despise that thing.  I don't
even do -rc tags.  .nn is .nn until baked and nn.1 appears.

(would be nice if baked were immediately handed to stable .nn.0 instead
of being in limbo for a bit, but I don't drive the cart, just tag along
behind [w. shovel];)

	-Mike


^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-27 15:46     ` Mike Galbraith
@ 2009-03-27 16:02       ` Linus Torvalds
  2009-03-28  7:50         ` Mike Galbraith
  2009-03-30 22:00         ` Hans-Peter Jansen
  0 siblings, 2 replies; 664+ messages in thread
From: Linus Torvalds @ 2009-03-27 16:02 UTC (permalink / raw)
  To: Mike Galbraith; +Cc: Geert Uytterhoeven, Hans-Peter Jansen, linux-kernel



On Fri, 27 Mar 2009, Mike Galbraith wrote:
> > 
> > If you're using the kernel-of-they-day, you're probably using git, and
> > CONFIG_LOCALVERSION_AUTO=y should be mandatory.
> 
> I sure hope it never becomes mandatory, I despise that thing.  I don't
> even do -rc tags.  .nn is .nn until baked and nn.1 appears.

If you're a git user that changes kernels frequently, then enabling 
CONFIG_LOCALVERSION_AUTO is _really_ convenient when you learn to use it.

This is quite common for me:

	gitk v$(uname -r)..

and it works exactly due to CONFIG_LOCALVERSION_AUTO (and because git is 
rather good at figuring out version numbers). It's a great way to say 
"ok, what is in my git tree that I'm not actually running right now".

Another case where CONFIG_LOCALVERSION_AUTO is very useful is when you're 
noticing some new broken behavior, but it took you a while to notice. 
You've rebooted several times since, but you know it worked last Tuesday. 
What do you do?

The thing to do is

	grep "Linux version" /var/log/messages*

and figure out what the good version was, and then do 

	git bisect start
	git bisect good ..that-version..
	git bisect bad v$(uname -r)

and off you go. This is _very_ convenient if you are working with some 
"random git kernel of the day" like I am (and like hopefully others are 
too, in order to get test coverage).

> (would be nice if baked were immediately handed to stable .nn.0 instead
> of being in limbo for a bit, but I don't drive the cart, just tag along
> behind [w. shovel];)

Note that the "v2.6.29[-rcX" part is totally _useless_ in many cases, 
because if you're working past merges, and especially if you end up doing 
bisection, it is very possible that the main Makefile says "2.6.28-rc2", 
but the code you're working on wasn't actually _merged_ until after 
2.6.29. 

In other words, the main Makefile version is totally useless in non-linear 
development, and is meaningful _only_ at specific release times. In 
between releases, it's essentially a random thing, since non-linear 
development means that versioning simply fundamentally isn't some simple
monotonic numbering. And this is exactly when CONFIG_LOCALVERSION_AUTO is 
a huge deal.

(It's even more so if you end up looking at "next" or merging other 
peoples trees. If you only ever track my kernel, and you only ever 
fast-forward - no bisection, no nothing - then the release numbering looks 
"simple", and things like LOCALVERSION looks just like noise).

			Linus

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-27 15:22                                         ` Matthew Garrett
@ 2009-03-27 16:15                                           ` Alan Cox
  2009-03-27 16:28                                             ` Matthew Garrett
  0 siblings, 1 reply; 664+ messages in thread
From: Alan Cox @ 2009-03-27 16:15 UTC (permalink / raw)
  To: Matthew Garrett
  Cc: Theodore Tso, Linus Torvalds, Andrew Morton, David Rees,
	Jesper Krogh, Linux Kernel Mailing List

O> No. Not *having* to check for errors in the cases that you care about is 
> progress. How much of the core kernel actually deals with kmalloc 
> failures sensibly? Some things just aren't worth it.

I'm glad to know thats how you feel about my data, it explains a good
deal about the state of some of the desktop software. In kernel land we
actually have tools that go looking for kmalloc errors and missing tests
to try and check all the paths. We run kernels with kmalloc randomly
failing to make sure the box stays up: because at the end of the day
*kmalloc does fail*. The kernel also tries very hard to keep the fail
rate low - but this doesn't mean you don't check for errors.

Everything in other industry says not having to check for errors is
missing the point. You design systems so that they do not have error
cases when possible, and if they have error cases you handle them and
enforce a policy that prevents them not being handled.

Standard food safety rules include 

Labelling food with dates
Having an electronic system so that any product with no label cannot
escape
Checking all labels to ensure nothing past the safe date is sold
Having rules at all stages that any item without a label is removed and
is flagged back so that it can be investigated

Now you are arguing for "not having to check for errors"

So I assume you wouldn't worry about food that ends up with no label on
it somehow ?

Or when you get a "permission denied" do you just assume it didn't
happen ? If the bank says someone has removed all your money do you
assume its an error you don't need to check for ?

The two are *not* the same thing.

You design failure out when possible
You implement systems which ensure all known failure cases must be handled
You track failure rates to prove your analysis
Where you don't handle a failure (because it is too hard) you have
detailed statistical and other analysis based on rigorous methodologies
as to whether not handling it is acceptable (eg ALARP)

and unfortunately at big name universities you can still get a degree or
masters even in software "engineering" without actually studying any of
this stuff, which any real engineering discipline would consider basic
essentials.

How do we design failure out
- One obvious one is to report out of disk space on write not close. At
  the app level programmers need to actually check their I/O returns
  because contrary to much of todays garbage software (open and
  proprietary) or use languages which actually tell them off if each
  exception case is not caught somewhere

- Use disk and file formats that ensure across a failure you don't
  suddenly get random users medical data popping up post reboot in
  index.html or motd. Hence ordered data writes by default (or the same
  effect)

- Writing back data regularly to allow for the fact user space
  programmers will make mistakes regardless. But this doesn't mean they
  "don't check for errors"

And if you think an error check isn't worth making then I hope you can
provide the statistical data, based on there being millions of such
systems and in the case of sloppy application writing where the result
is "oh dear where did the data go" I don't think you can at the moment.

To be honest I don't see your problem. Surely well designed desktop
applications are already all using nice error handling, out of space and
fsync aware interfaces in the gnome library that do all the work for them
- "so they don't have to check for errors".

If not perhaps the desktop should start by putting their own house in
order ?

Alan

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-27 16:15                                           ` Alan Cox
@ 2009-03-27 16:28                                             ` Matthew Garrett
  2009-03-27 16:51                                               ` Alan Cox
  0 siblings, 1 reply; 664+ messages in thread
From: Matthew Garrett @ 2009-03-27 16:28 UTC (permalink / raw)
  To: Alan Cox
  Cc: Theodore Tso, Linus Torvalds, Andrew Morton, David Rees,
	Jesper Krogh, Linux Kernel Mailing List

On Fri, Mar 27, 2009 at 04:15:53PM +0000, Alan Cox wrote:

> To be honest I don't see your problem. Surely well designed desktop
> applications are already all using nice error handling, out of space and
> fsync aware interfaces in the gnome library that do all the work for them
> - "so they don't have to check for errors".

The context was situations like errors on close() not occuring unless 
you've fsync()ed first. I don't think that error case is sufficiently 
common to warrant the cost of an fsync() on every single close, 
especially since doing so would cripple any application that ever tried 
to run on ext3.

-- 
Matthew Garrett | mjg59@srcf.ucam.org

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-27 13:35 ` Linux 2.6.29 Hans-Peter Jansen
  2009-03-27 14:53   ` Geert Uytterhoeven
@ 2009-03-27 16:49   ` Frans Pop
  1 sibling, 0 replies; 664+ messages in thread
From: Frans Pop @ 2009-03-27 16:49 UTC (permalink / raw)
  To: Hans-Peter Jansen; +Cc: linux-kernel

Hans-Peter Jansen wrote:
> Am Dienstag, 24. März 2009 schrieb Linus Torvalds:
>> This obviously starts the merge window for 2.6.30, although as usual,
>> I'll probably wait a day or two before I start actively merging.
> 
> It would be very nice, if you could start with a commit to Makefile,
> that reflects the new series: e.g.:

If you have a git checkout, you can easily do this yourself:

git checkout -b 2.6.30-rc master
sed -i "/^SUBLEVEL/ s/29/30/; /^EXTRAVERSION/ s/$/ -rc0/" Makefile
git add Makefile
git commit -m "Mark as -rc0"

Then to get latest git head:

git checkout master
git pull
git rebase master 2.6.30-rc

When Linus releases -rc1, the rebase will signal a conflict on that commit 
and you can just 'git rebase --skip' it.

Instead of sed you can also just edit the Makefile of course, or you can 
go the other way and create a simple script that automatically increases 
the existing sublevel by 1. I just do this manually, given that it's only 
needed once per three months or so.

Using a branch is something I do anyway as I almost always have a few 
minor patches on top of git head for various reasons.

Cheers,
FJP

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-27 16:28                                             ` Matthew Garrett
@ 2009-03-27 16:51                                               ` Alan Cox
  2009-03-27 17:02                                                 ` Matthew Garrett
  0 siblings, 1 reply; 664+ messages in thread
From: Alan Cox @ 2009-03-27 16:51 UTC (permalink / raw)
  To: Matthew Garrett
  Cc: Theodore Tso, Linus Torvalds, Andrew Morton, David Rees,
	Jesper Krogh, Linux Kernel Mailing List

On Fri, 27 Mar 2009 16:28:41 +0000
Matthew Garrett <mjg59@srcf.ucam.org> wrote:

> On Fri, Mar 27, 2009 at 04:15:53PM +0000, Alan Cox wrote:
> 
> > To be honest I don't see your problem. Surely well designed desktop
> > applications are already all using nice error handling, out of space and
> > fsync aware interfaces in the gnome library that do all the work for them
> > - "so they don't have to check for errors".
> 
> The context was situations like errors on close() not occuring unless 
> you've fsync()ed first. I don't think that error case is sufficiently 
> common to warrant the cost of an fsync() on every single close, 
> especially since doing so would cripple any application that ever tried 
> to run on ext3.

The fsync if you need to see all errors on close case has been true since
before V7 unix. Its the normal default behaviour on these systems so
anyone who assumes otherwise is just broken. There is a limit to the
extent the OS can clean up after completely broken user apps.

Besides which a properly designed desktop clearly has a single interface
of the form   

	happened = write_file_reliably(filename|NULL, buffer, len, flags)
	happened = replace_file_reliably(filename|NULL, buffer, len,
	flags (eg KEEP_BACKUP));

which internally does all the error handling, reporting to user, offering
to save elsewhere, ensuring that the user can switch app and make space
and checking for media errors. It probably also has an asynchronous
version you can bind event handlers to for completion, error, etc so that
you can override the default handling but can't fail to provide something
by default.  That would be designing failure out of the system.

IMHO the real solution to a lot of this actually got proposed earlier in
the thread. Adding "fbarrier()" allows the expression of ordering without
blocking and provides something new apps can use to get best performance.

Old properly written apps continue to work and can be improved, and sloppy
garbage continues to mostly work.

The file system behaviour is constrained heavily by the hardware, which
at this point is constrained by the laws of physics and the limits
of materials.

Alan

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-27 16:51                                               ` Alan Cox
@ 2009-03-27 17:02                                                 ` Matthew Garrett
  2009-03-27 17:19                                                   ` Alan Cox
  2009-03-27 17:57                                                   ` Linus Torvalds
  0 siblings, 2 replies; 664+ messages in thread
From: Matthew Garrett @ 2009-03-27 17:02 UTC (permalink / raw)
  To: Alan Cox
  Cc: Theodore Tso, Linus Torvalds, Andrew Morton, David Rees,
	Jesper Krogh, Linux Kernel Mailing List

On Fri, Mar 27, 2009 at 04:51:50PM +0000, Alan Cox wrote:
> On Fri, 27 Mar 2009 16:28:41 +0000
> Matthew Garrett <mjg59@srcf.ucam.org> wrote:
> > The context was situations like errors on close() not occuring unless 
> > you've fsync()ed first. I don't think that error case is sufficiently 
> > common to warrant the cost of an fsync() on every single close, 
> > especially since doing so would cripple any application that ever tried 
> > to run on ext3.
> 
> The fsync if you need to see all errors on close case has been true since
> before V7 unix. Its the normal default behaviour on these systems so
> anyone who assumes otherwise is just broken. There is a limit to the
> extent the OS can clean up after completely broken user apps.

If user applications should always check errors, and if errors can't be 
reliably produced unless you fsync() before close(), then the correct 
behaviour for the kernel is to always flush buffers to disk before 
returning from close(). The reason we don't is that it would be an 
unacceptable performance hit to take in return for an uncommon case - in 
exactly the same way as always calling fsync() before close() is an 
unacceptable performance hit to take in return for an uncommon case.

> IMHO the real solution to a lot of this actually got proposed earlier in
> the thread. Adding "fbarrier()" allows the expression of ordering without
> blocking and provides something new apps can use to get best performance.

If every application that does a clobbering rename has to call 
fbarrier() first, then the kernel should just guarantee to do so on the 
application's behalf. ext3, ext4 and btrfs all effectively do this, so 
we should just make it explicit that Linux filesystems are expected to 
behave this way. If people want to make their code Linux specific then 
that's their problem, not the kernel's.

-- 
Matthew Garrett | mjg59@srcf.ucam.org

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-27 17:02                                                 ` Matthew Garrett
@ 2009-03-27 17:19                                                   ` Alan Cox
  2009-03-27 18:05                                                     ` Linus Torvalds
  2009-03-27 18:36                                                     ` Hua Zhong
  2009-03-27 17:57                                                   ` Linus Torvalds
  1 sibling, 2 replies; 664+ messages in thread
From: Alan Cox @ 2009-03-27 17:19 UTC (permalink / raw)
  To: Matthew Garrett
  Cc: Theodore Tso, Linus Torvalds, Andrew Morton, David Rees,
	Jesper Krogh, Linux Kernel Mailing List

O> If user applications should always check errors, and if errors can't be 
> reliably produced unless you fsync() before close(), then the correct 
> behaviour for the kernel is to always flush buffers to disk before 
> returning from close(). The reason we don't is that it would be an 

You make a few assumptions here

Unfortunately:
- close() occurs many times on a file
- the kernel cannot tell which close() calls need to commit data
- there are many cases where data is written and there is a genuine
  situation where it is acceptable over a crash to lose data providing
  media failure is rare (eg log files in many situations - not banks
  obviously)

The kernel cannot tell them apart, while fsync/close() as a pair allows
the user to correctly indicate their requirements.

Even "fsync on last close" can backfire horribly if you happen to have a
handle that is inherited by a child task or kept for reading for a long
period.

For an event driven app you really want some kind of threaded or async
fsync then close (fbarrier isn't quite enough because you don't get told
when the barrier is passed). That could be implemented using threads in
the relevant desktops libraries with the thread doing

	fsync()
	poke event thread
	exit

(or indeed for most cases as part of the more general
write-file-interact-with-user-etc call)

> If every application that does a clobbering rename has to call 
> fbarrier() first, then the kernel should just guarantee to do so on the 

Rename is a different problem - and a nastier one. Unfortunately even in
posix fsync says nothing about how metadata updating is handled or what
the ordering rules are between two fsync() calls on different files.

There were problems with trying to order rename against data writeback.
fsync ensures the file data and metadata is valid but doesn't (and
cannot) connect this with the directory state. So if you need to implement

	write data
	ensure it is committed
	rename it
	after the rename is committed then ...

you can't do that in POSIX. Linux extends fsync() so you can fsync a
directory handle but that is an extension to fix the problem rather than
a standard behaviour.

(Also helpful here would be fsync_range, fdatasync_range and
fbarrier_range)

> application's behalf. ext3, ext4 and btrfs all effectively do this, so 
> we should just make it explicit that Linux filesystems are expected to 
> behave this way. 

> If people want to make their code Linux specific then  that's their problem, not the kernel's.

Agreed - which is why close should not happen to do an fsync(). That's
their problem for writing code thats specific to some random may happen
behaviour on certain Linux releases - and unfortunately with no obvious
cheap cure.

--
	"Alan, I'm getting a bit worried about you."
				-- Linus Torvalds

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-27 17:02                                                 ` Matthew Garrett
  2009-03-27 17:19                                                   ` Alan Cox
@ 2009-03-27 17:57                                                   ` Linus Torvalds
  2009-03-27 18:22                                                     ` Linus Torvalds
                                                                       ` (2 more replies)
  1 sibling, 3 replies; 664+ messages in thread
From: Linus Torvalds @ 2009-03-27 17:57 UTC (permalink / raw)
  To: Matthew Garrett
  Cc: Alan Cox, Theodore Tso, Andrew Morton, David Rees, Jesper Krogh,
	Linux Kernel Mailing List



On Fri, 27 Mar 2009, Matthew Garrett wrote:
> 
> If every application that does a clobbering rename has to call 
> fbarrier() first, then the kernel should just guarantee to do so on the 
> application's behalf. ext3, ext4 and btrfs all effectively do this, so 
> we should just make it explicit that Linux filesystems are expected to 
> behave this way. If people want to make their code Linux specific then 
> that's their problem, not the kernel's.

It would probably be good to think about something like this, because 
there are currently really two totally different cases of "fsync()" users.

 (a) The "critical safety" kind (aka the "traditional" fsync user), where 
     there is a mail server or similar that will reply "all done" to the 
     sender, and has to _guarantee_ that the file is on disk in order for 
     data to simply not be lost.

     This is a very different case from most desktop uses, and it's a evry 
     hard "we have to wait until the thing is physically on disk" 
     situation. And it's the only case where people really traditionally 
     used "fsync()".

 (b) The non-traditional UNIX usage where people historically didn't use
     fsync() for: people editing their config files either 
     programmatically or by hand.

     And this one really doesn't need at all the same kind of hard "wait 
     for it to hit the disk" semantics. It may well want a much softer 
     kind of "at least don't delete the old version until the new version 
     is stable" kind of thing.

And Alan - you can argue that fsync() has been around forever, but you 
cannot possibly argue that people have used fsync() for file editing. 
That's simply not true. It has happened, but it has been very rare. Yes, 
some editors (vi, emacs) do it, but even there it's configurable. And 
outside of databases, server apps and big editors, fsync is virtually 
unheard of. How many sed-scripts have you seen to edit files? None of them 
ever used fsync.

And with the ext3 performance profile for it, it sure is not getting any 
more common either. If you have a desktop app that uses fsync(), that 
application is DEAD IN THE WATER if people are doing anything else on the 
machine. Those multi-second pauses aren't going to make people happy.

So the fact is, "people should always use fsync" simply isn't a realistic 
expectation, nor is it historically accurate. Claiming it is is just 
obviously bogus. And claiming that people _should_ do it is crazy, since 
it performs badly enough to simply not be realistic.

Alternatives should be looked at. For desktop apps, the best alternatives 
are likely simply stronger default consistency guarantees. Exactly the 
"we don't guarantee that your data hits the disk, but we do guarantee that 
if you renamed on top of another file, you'll not have lost _both_ 
contents".

			Linus

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-27 17:19                                                   ` Alan Cox
@ 2009-03-27 18:05                                                     ` Linus Torvalds
  2009-03-27 18:35                                                       ` Alan Cox
  2009-03-27 19:03                                                       ` Theodore Tso
  2009-03-27 18:36                                                     ` Hua Zhong
  1 sibling, 2 replies; 664+ messages in thread
From: Linus Torvalds @ 2009-03-27 18:05 UTC (permalink / raw)
  To: Alan Cox
  Cc: Matthew Garrett, Theodore Tso, Andrew Morton, David Rees,
	Jesper Krogh, Linux Kernel Mailing List



On Fri, 27 Mar 2009, Alan Cox wrote:
> 
> The kernel cannot tell them apart, while fsync/close() as a pair allows
> the user to correctly indicate their requirements.

Alan. Repeat after me: "fsync()+close() is basically useless for any app 
that expects user interaction under load".

That's a FACT, not an opinion.

> For an event driven app you really want some kind of threaded or async
> fsync then close

Don't be silly. If you want data corruption, then you make people write 
threaded applications. Yes, you may work for Intel now, but that doesn't 
mean that you have to drink the insane cool-aid. Threading is HARD. Async 
stuff is HARD. 

We kernel people really are special. Expecting normal apps to spend the 
kind of effort we do (in scalability, in error handling, in security) is 
just not realistic. 

> Agreed - which is why close should not happen to do an fsync(). That's
> their problem for writing code thats specific to some random may happen
> behaviour on certain Linux releases - and unfortunately with no obvious
> cheap cure.

I do agree that close() shouldn't do an fsync - simply for performance 
reasons.

But I also think that the "we write meta-data synchronously, but then the 
actual data shows up at some random later time" is just crazy talk. That's 
simply insane. It _guarantees_ that there will be huge windows of times 
where data simply will be lost if something bad happens.

And expecting every app to do fsync() is also crazy talk, especially with 
the major filesystems _sucking_ so bad at it (it's actually a lot more 
realistic with ext2 than it is with ext3).

So look for a middle ground. Not this crazy militant "user apps must do 
fsync()" crap. Because that is simply not a realistic scenario.

			Linus

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-27 17:57                                                   ` Linus Torvalds
@ 2009-03-27 18:22                                                     ` Linus Torvalds
  2009-03-27 18:32                                                     ` Alan Cox
  2009-03-27 19:43                                                     ` Jeff Garzik
  2 siblings, 0 replies; 664+ messages in thread
From: Linus Torvalds @ 2009-03-27 18:22 UTC (permalink / raw)
  To: Matthew Garrett
  Cc: Alan Cox, Theodore Tso, Andrew Morton, David Rees, Jesper Krogh,
	Linux Kernel Mailing List



On Fri, 27 Mar 2009, Linus Torvalds wrote:
> 
> Yes, some editors (vi, emacs) do it, but even there it's configurable. 

.. and looking at history, it's even pretty modern. From the vim logs:

	Patch 6.2.499
	Problem:   When writing a file and halting the system, the file might be lost
	           when using a journalling file system.
	Solution:  Use fsync() to flush the file data to disk after writing a file.
	           (Radim Kolar)
	Files:     src/fileio.c

so it looks (assuming those patch numbers mean what they would seem to 
mean) that 'fsync()' in vim is from after 6.2 was released. Some time in 
2004.

So traditionally, even solid "good" programs like major editors never 
tried to fsync() their files.

Btw, googling for that 6.2.499 patch also shows that people were rather 
unhappy with it. Why? It causes disk spinups in laptop mode etc. Which is 
very much not what you want to see for power reasons.

So there are other, really fundamental, reasons why applications that 
don't have the "mailspool must not be lost" kind of critical issues to 
absolutely NOT use fsync(). Those applications would be much better off 
with some softer hint that can take things like laptop mode into account.

		Linus

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-27 17:57                                                   ` Linus Torvalds
  2009-03-27 18:22                                                     ` Linus Torvalds
@ 2009-03-27 18:32                                                     ` Alan Cox
  2009-03-27 18:40                                                       ` Linus Torvalds
  2009-03-27 19:43                                                     ` Jeff Garzik
  2 siblings, 1 reply; 664+ messages in thread
From: Alan Cox @ 2009-03-27 18:32 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Matthew Garrett, Theodore Tso, Andrew Morton, David Rees,
	Jesper Krogh, Linux Kernel Mailing List

> more common either. If you have a desktop app that uses fsync(), that 
> application is DEAD IN THE WATER if people are doing anything else on the 
> machine. Those multi-second pauses aren't going to make people happy.

We added threading about ten years ago.

> So the fact is, "people should always use fsync" simply isn't a realistic 
> expectation, nor is it historically accurate. 

Far too many people don't - and it is unfortunate but people should learn
to write quality software.
> 
> Alternatives should be looked at. For desktop apps, the best alternatives 
> are likely simply stronger default consistency guarantees. Exactly the 
> "we don't guarantee that your data hits the disk, but we do guarantee that 
> if you renamed on top of another file, you'll not have lost _both_ 
> contents".

Rename is a really nasty case and the standards don't help at all here so
I agree entirely. There *isn't* a way to write a correct portable
application that achieves that guarantee without the kernel making it for
you.

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-27 18:05                                                     ` Linus Torvalds
@ 2009-03-27 18:35                                                       ` Alan Cox
  2009-03-27 19:03                                                       ` Theodore Tso
  1 sibling, 0 replies; 664+ messages in thread
From: Alan Cox @ 2009-03-27 18:35 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Matthew Garrett, Theodore Tso, Andrew Morton, David Rees,
	Jesper Krogh, Linux Kernel Mailing List

> Don't be silly. If you want data corruption, then you make people write 
> threaded applications. Yes, you may work for Intel now, but that doesn't 
> mean that you have to drink the insane cool-aid. Threading is HARD. Async 
> stuff is HARD. 

Which is why you do it once in a library and express it as events. The
gtk desktop already does this and the event model it provides is rather
elegant and can handle this neatly and cleanly for the user.

> But I also think that the "we write meta-data synchronously, but then the 
> actual data shows up at some random later time" is just crazy talk. That's 
> simply insane. It _guarantees_ that there will be huge windows of times 
> where data simply will be lost if something bad happens.

Agreed - apps not checking for errors is sloppy programming however given
they make errors we don't want to make it worse. I wouldn't argue with
that - for the same reason that cars are designed on the basis that their
owners are not competent to operate them ;)


^ permalink raw reply	[flat|nested] 664+ messages in thread

* RE: Linux 2.6.29
  2009-03-27 17:19                                                   ` Alan Cox
  2009-03-27 18:05                                                     ` Linus Torvalds
@ 2009-03-27 18:36                                                     ` Hua Zhong
  1 sibling, 0 replies; 664+ messages in thread
From: Hua Zhong @ 2009-03-27 18:36 UTC (permalink / raw)
  To: 'Alan Cox', 'Matthew Garrett'
  Cc: 'Theodore Tso', 'Linus Torvalds',
	'Andrew Morton', 'David Rees',
	'Jesper Krogh', 'Linux Kernel Mailing List'

Why are we even arguing about standards?

POSIX, as all other standards, is a common _denominator_ and absolutely the
_minimal_ requirement for a compliant operating system. It does not tell you
how to design the best systems in the real world. For God's sake, can't we
aim for something higher than a piece of literature written some 20 years
ago? And stop making excuses please?

The fact is, most software is crap, and most software developers are lazy
and stupid. Same as most customers are stupid too. A technically correct
operating system isn't necessarily the most successful and accepted
operating system. Have a sense of pragmatism if you are developing something
that is not just a fancy research project.

And it's especially true for ext4. I bet nobody would care about what it did
if it called itself bloody-fast-next-gen-fs, and of course probably nobody
would use it either. But since it's putting the "ext" and "next default
Linux filesystem in all distros" hat on, it'd better take both the glory and
the crap with it. So, no matter whether ext3 made some mistakes, you can't
just throw it all away while keeping its name to give people the false sense
of comfort.

I am really glad that Theodore changed ext4 to handle the common practice of
truncate/rename sequences. It's absolutely necessary. It's not a "favor for
stupid user space", but a mandatory requirement if you even remotely want it
to be a general-purpose file system. In the end, it doesn't matter how
standard compliant you are - people will only choose the filesystem that is
the most reliable, fastest, and works with the most number of applications.

Hua



^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-27 18:32                                                     ` Alan Cox
@ 2009-03-27 18:40                                                       ` Linus Torvalds
  2009-03-27 19:00                                                         ` Alan Cox
  2009-03-27 20:27                                                         ` Felipe Contreras
  0 siblings, 2 replies; 664+ messages in thread
From: Linus Torvalds @ 2009-03-27 18:40 UTC (permalink / raw)
  To: Alan Cox
  Cc: Matthew Garrett, Theodore Tso, Andrew Morton, David Rees,
	Jesper Krogh, Linux Kernel Mailing List



On Fri, 27 Mar 2009, Alan Cox wrote:
> 
> > So the fact is, "people should always use fsync" simply isn't a realistic 
> > expectation, nor is it historically accurate. 
> 
> Far too many people don't - and it is unfortunate but people should learn
> to write quality software.

You're ignoring reality.

Your definition of "quality software" is PURE SH*T.

Look at that laptop disk spinup issue. Look at the performance issue. Look 
at something as nebulous as "usability".

If adding fsync's makes software unusable (and it does), then you 
shouldn't call that "quality software".

Alan, just please face that reality, and think about it for a moment. If 
fsync() was instantaneous, this discussion wouldn't exist. But read the 
thread. We're talking 3-5s under NORMAL load, with peaks of minutes.

		Linus

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-27 18:40                                                       ` Linus Torvalds
@ 2009-03-27 19:00                                                         ` Alan Cox
  2009-03-29  9:15                                                           ` Xavier Bestel
  2009-03-27 20:27                                                         ` Felipe Contreras
  1 sibling, 1 reply; 664+ messages in thread
From: Alan Cox @ 2009-03-27 19:00 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Matthew Garrett, Theodore Tso, Andrew Morton, David Rees,
	Jesper Krogh, Linux Kernel Mailing List

> > Far too many people don't - and it is unfortunate but people should learn
> > to write quality software.
> 
> You're ignoring reality.
> 
> Your definition of "quality software" is PURE SH*T.

Actually "pure sh*t" is most of the software currently written. The more
code I read the happier I get that the lawmakers are finally sick of it
and going to make damned sure software is subject to liability law. Boy
will that improve things.

> Alan, just please face that reality, and think about it for a moment. If 
> fsync() was instantaneous, this discussion wouldn't exist. But read the 
> thread. We're talking 3-5s under NORMAL load, with peaks of minutes.

The peaks of minutes is a bug. The 3-5 seconds is the thread discussion.

Alan

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-27 18:05                                                     ` Linus Torvalds
  2009-03-27 18:35                                                       ` Alan Cox
@ 2009-03-27 19:03                                                       ` Theodore Tso
  2009-03-27 19:14                                                         ` Alan Cox
  2009-03-27 19:19                                                         ` Gene Heskett
  1 sibling, 2 replies; 664+ messages in thread
From: Theodore Tso @ 2009-03-27 19:03 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Alan Cox, Matthew Garrett, Andrew Morton, David Rees,
	Jesper Krogh, Linux Kernel Mailing List

On Fri, Mar 27, 2009 at 11:05:58AM -0700, Linus Torvalds wrote:
> 
> Alan. Repeat after me: "fsync()+close() is basically useless for any app 
> that expects user interaction under load".
> 
> That's a FACT, not an opinion.

This is a fact for ext3 with data=ordered mode.  Which is the default
and dominant filesystem today, yes.  But it's not true for most other
filesystems.  Hopefully at some point we will migrate people off of
ext3 to something better.  Ext4 is available today, and is much better
at this than ext4.  In the long run, btrfs will be better yet.  The
issue then is how do we transition people away from making assumptions
that were essentially only true for ext3's data=ordered mode.  Ext4,
btrfs, XFS, all will have the property that if you fsync() a small
file, it will be fast, and it won't inflict major delays for other
programs running on the same system.

You've said for a long that that ext3 is really bad in that it
inflicts this --- I agree with you.  People should use other
filesystems which are better.  This includes ext4, which is completely
format compatible with ext3.  They don't even have to switch on
extents support to get better behaviour.  Just mounting an ext3
filesystem with ext4 will result in better behaviour.

So maybe we can't tell application writers, *today*, that they should
use fsync().  But in the future, we should be able to tell them that.
Or maybe we can tell them that if they want, they can use some new
interface, such as a proposed fbarrier() that will do the right thing
(including perhaps being a no-op on ext3) no matter what the
filesystem might be.

I do believe that the last thing we should do is tell people that
because of the characteristics of ext3s, which you yourself have said
sucks, and which we've largely fixed for ext4, and which isn't a
problem with other filesystems, including some that may likely replace
ext3 *and* ext4, that we should give people advice that will lock
applications into doing some very bad things for the indefinite
future.

And I'm not blaming userspace; this is at least as much, if not
entirely, ext3's fault.  What that means is we need to work on a way
of providing a transition path back to a better place for the overall
system, which includes both the kernel and userspace application
libraries, such as those found in GNOME, KDE, et. al.

> So look for a middle ground. Not this crazy militant "user apps must do 
> fsync()" crap. Because that is simply not a realistic scenario.

Agreed, we need a middle ground.  We need a transition path that
recognizes that ext3 won't be the dominant filesystem for Linux in
perpetuity, and that ext3's data=ordered semantics will someday no
longer be a major factor in application design.  fbarrier() semantics
might be one approach; there may be others.  It's something we need to
figure out.

						- Ted

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-27 19:03                                                       ` Theodore Tso
@ 2009-03-27 19:14                                                         ` Alan Cox
  2009-03-27 19:32                                                           ` Theodore Tso
  2009-03-27 19:19                                                         ` Gene Heskett
  1 sibling, 1 reply; 664+ messages in thread
From: Alan Cox @ 2009-03-27 19:14 UTC (permalink / raw)
  To: Theodore Tso
  Cc: Linus Torvalds, Matthew Garrett, Andrew Morton, David Rees,
	Jesper Krogh, Linux Kernel Mailing List

> Agreed, we need a middle ground.  We need a transition path that
> recognizes that ext3 won't be the dominant filesystem for Linux in
> perpetuity, and that ext3's data=ordered semantics will someday no
> longer be a major factor in application design.  fbarrier() semantics
> might be one approach; there may be others.  It's something we need to
> figure out.

Would making close imply fbarrier() rather than fsync() work for this ?
That would give people the ordering they want even if they are less
careful but wouldn't give the media error cases - which are less
interesting.

Alan

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-27  7:57                                 ` Jens Axboe
  2009-03-27 14:13                                   ` Theodore Tso
@ 2009-03-27 19:14                                   ` Chris Mason
  1 sibling, 0 replies; 664+ messages in thread
From: Chris Mason @ 2009-03-27 19:14 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Linus Torvalds, Jeff Garzik, Theodore Tso, Ingo Molnar, Alan Cox,
	Arjan van de Ven, Andrew Morton, Peter Zijlstra, Nick Piggin,
	David Rees, Jesper Krogh, Linux Kernel Mailing List

On Fri, 2009-03-27 at 08:57 +0100, Jens Axboe wrote:
> On Wed, Mar 25 2009, Linus Torvalds wrote:
> > 
> > 
> > On Wed, 25 Mar 2009, Jeff Garzik wrote:
> > > 
> > > It is clearly possible to implement an fsync(2) that causes FLUSH CACHE to be
> > > issued, without adding full barrier support to a filesystem.  It is likely
> > > doable to avoid touching per-filesystem code at all, if we issue the flush
> > > from a generic fsync(2) code path in the kernel.
> > 
> > We could easily do that. It would even work for most cases. The 
> > problematic ones are where filesystems do their own disk management, but I 
> > guess those people can do their own fsync() management too.
> > 
> > Somebody send me the patch, we can try it out.
> 
> Here's a simple patch that does that. Not even tested, it compiles. Note
> that file systems that currently do blkdev_issue_flush() in their
> ->sync() should then get it removed.
> 

The filesystems vary a bit, but in general the perfect fsync (in a mail
server workload) works something like this:

step1: write out and wait for any dirty data
step2: join the running transaction

step3: hang around a bit and wait for friends and neighbors

step4: commit the transaction

step4a: write the log blocks
step4b: barrier.  This barrier also makes sure the data is on disk

step4c: write the commit block

step4d: barrier.  This barrier makes sure the commit block is on disk.

For ext34 and reiserfs, steps 4b,c,d are actually one call to submit_bh
where two caches flushes are done for us, but they really are two cache
flushes.

During step 3, we collect a bunch of other procs who are hopefully also
running fsync.  If we collect 50 procs, then single the barrier in step
5b does a cache flush on the data writes of all 50.  50 flushes this
patch does would be one flush if the FS did it right.

In a multi-process fsync heavy workload, every extra barrier is going to
have work to do because someone is always sending data down.

The flushes done by this patch also aren't helpful for the journaled
filesystem.  If we remove the barriers from step 4b or 4d, we no longer
have a consistent FS on power failure.  Log checksumming may allow us to
get rid of the barrier in step 4b, but then we wouldn't know the data
blocks were on disk before the transaction commit, and we've had a few
discussions on that already over the last two weeks.

The patch also assumes the FS has one bdev, which isn't true for btrfs.

xfs and btrfs at least want more control over that
filemap_fdatawrite/wait step because we have to repeat it inside the FS
anyway to make sure the inode properly updated before the commit.  I'd
much rather see a dumb fsync helper that looks like Jens' vfs_fsync, and
then let the filesystems make their own replacement for the helper in a
new address space operation or super operation.

That way we could also run the fsync on directories without the
directory mutex held, which is much faster.

Also, the patch is sending the return value from blkdev_issue_flush out
through vfs_fsync, which means I think it'll send -EOPNOTSUPP out to
userland.

So, I should be able to run any benchmark that does an fsync with this
patch and find large regressions.  It turns out it isn't quite that
easy.

First, I found that ext4 has a neat feature where it is already doing an
extra barrier on every fsync.

Even with that removed, the flushes made ext4 faster (doh!).  Looking at
the traces, ext4 and btrfs (which is totally unchanged by this patch)
both do a good job of turning my simple fsync hammering programs into
mostly sequential IO.

The extra flushes are just writing mostly sequential IO, and so they
aren't really hurting overall tput.  Plus, Ric reminded me the drive may
have some pass through for larger sequential writes, and ext4 and btrfs
may be doing enough to trigger that.

I should be able to run more complex benchmarks and get really bad
numbers out of this patch with ext4, but instead I'll try ext3...
  
This is a simple run with fs_mark using 64 threads to create 20k files
with fsync.  I timed how long it took to create 900 files.  Lower
numbers are better.

      unpatched    patched
ext3    236s         286s

-chris



^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-27 19:03                                                       ` Theodore Tso
  2009-03-27 19:14                                                         ` Alan Cox
@ 2009-03-27 19:19                                                         ` Gene Heskett
  2009-03-27 19:48                                                           ` Theodore Tso
  1 sibling, 1 reply; 664+ messages in thread
From: Gene Heskett @ 2009-03-27 19:19 UTC (permalink / raw)
  To: Theodore Tso, Linus Torvalds, Alan Cox, Matthew Garrett,
	Andrew Morton, David Rees, Jesper Krogh,
	Linux Kernel Mailing List

On Friday 27 March 2009, Theodore Tso wrote:
>On Fri, Mar 27, 2009 at 11:05:58AM -0700, Linus Torvalds wrote:
>> Alan. Repeat after me: "fsync()+close() is basically useless for any app
>> that expects user interaction under load".
>>
>> That's a FACT, not an opinion.
>
>This is a fact for ext3 with data=ordered mode.  Which is the default
>and dominant filesystem today, yes.  But it's not true for most other
>filesystems.  Hopefully at some point we will migrate people off of
>ext3 to something better.  Ext4 is available today, and is much better
>at this than ext4.  In the long run, btrfs will be better yet.  The
>issue then is how do we transition people away from making assumptions
>that were essentially only true for ext3's data=ordered mode.  Ext4,
>btrfs, XFS, all will have the property that if you fsync() a small
>file, it will be fast, and it won't inflict major delays for other
>programs running on the same system.
>
>You've said for a long that that ext3 is really bad in that it
>inflicts this --- I agree with you.  People should use other
>filesystems which are better.  This includes ext4, which is completely
>format compatible with ext3.  They don't even have to switch on
>extents support to get better behaviour.  Just mounting an ext3
>filesystem with ext4 will result in better behaviour.

Ohkay.  But in a 'make xconfig' of 2.6.28.9, how much of ext4 can be turned on 
without rendering the old ext3 fstab defaults incompatible should I be forced 
to boot a kernel with no ext4 support?

-- 
Cheers, Gene
"There are four boxes to be used in defense of liberty:
 soap, ballot, jury, and ammo. Please use in that order."
-Ed Howdershelt (Author)
Never look a gift horse in the mouth.
		-- Saint Jerome


^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-27 19:14                                                         ` Alan Cox
@ 2009-03-27 19:32                                                           ` Theodore Tso
  2009-03-27 20:11                                                             ` Andreas T.Auer
  2009-03-31  9:58                                                             ` Neil Brown
  0 siblings, 2 replies; 664+ messages in thread
From: Theodore Tso @ 2009-03-27 19:32 UTC (permalink / raw)
  To: Alan Cox
  Cc: Linus Torvalds, Matthew Garrett, Andrew Morton, David Rees,
	Jesper Krogh, Linux Kernel Mailing List

On Fri, Mar 27, 2009 at 07:14:26PM +0000, Alan Cox wrote:
> > Agreed, we need a middle ground.  We need a transition path that
> > recognizes that ext3 won't be the dominant filesystem for Linux in
> > perpetuity, and that ext3's data=ordered semantics will someday no
> > longer be a major factor in application design.  fbarrier() semantics
> > might be one approach; there may be others.  It's something we need to
> > figure out.
> 
> Would making close imply fbarrier() rather than fsync() work for this ?
> That would give people the ordering they want even if they are less
> careful but wouldn't give the media error cases - which are less
> interesting.

The thought that I had was to create a new system call, fbarrier()
which has the semantics that it will request the filesystem to make
sure that (at least) changes that have been made data blocks to date
should be forced out to disk when the next metadata operation is
committed.  For ext3 in data=ordered mode, this would be a no-op.  For
other filesystems that had fast/efficient fsync()'s, it could simply
be an fsync().  For other filesystems, it could trigger an
asynchronous writeout, if the journal commit will wait for the
writeout to complete.  For yet other filesystems, it might set a flag
that will cause the filesystem to start a synchronous writeout of the
file as part of the commit operations.  The bottom line was that what
we could *then* tell application programmers to do is
open/write/fbarrier/close/rename.  (And for operating systems where
they don't have fbarrier, they can use autoconf magic to replace
fbarrier with fsync.)

We could potentially make close() imply fbarrier(), but there are
plenty of times when that might not be such a great idea.  If we do
that, we're back to requiring synchronous data writes for all files on
close(), which might lead to huge latencies, just as ext3's
data=ordered mode did.  And in many cases, where the files in
questions can be easily regenerated (such as object files in a kernel
tree build), there really is no reason why it's a good idea to force
the blocks to disk on close().  In the highly unusual case where we
crash in the middle of a kernel build; we can do a "make clean; make"
and regenerate the object files.

The fundamental idea here is not all files need to be forced to disk
on close.  Not all files need fsync(), or even fbarrier().  We can
make the system go much more quickly if we can make a distinction
between these two cases.  It can also make SSD drives last longer if
we don't force blocks to disk for non-precious files.  If people
disagree with this premise, we can go back to something very much like
ext3's data=ordered mode; but then we get *all* of the problems of
ext3's data=ordered mode, including the unexpected filesystem
latencies that Linus and Ingo have been complaining about so much.
The two are very much related.

Anyway, this is just one idea; I'm not claiming that fbarrier() is the
perfect solution --- but it is one I plan to propose at the upcoming
Linux Storage and Filesystem workshop in San Francisco in a week or
so.   Maybe someone else will have a better idea.  

					- Ted



^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-27 17:57                                                   ` Linus Torvalds
  2009-03-27 18:22                                                     ` Linus Torvalds
  2009-03-27 18:32                                                     ` Alan Cox
@ 2009-03-27 19:43                                                     ` Jeff Garzik
  2009-03-27 20:01                                                       ` Theodore Tso
  2009-03-27 21:46                                                       ` Linus Torvalds
  2 siblings, 2 replies; 664+ messages in thread
From: Jeff Garzik @ 2009-03-27 19:43 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Matthew Garrett, Alan Cox, Theodore Tso, Andrew Morton,
	David Rees, Jesper Krogh, Linux Kernel Mailing List

Linus Torvalds wrote:
> So the fact is, "people should always use fsync" simply isn't a realistic 
> expectation, nor is it historically accurate. Claiming it is is just 
> obviously bogus. And claiming that people _should_ do it is crazy, since 
> it performs badly enough to simply not be realistic.
> 
> Alternatives should be looked at. For desktop apps, the best alternatives 
> are likely simply stronger default consistency guarantees. Exactly the 
> "we don't guarantee that your data hits the disk, but we do guarantee that 
> if you renamed on top of another file, you'll not have lost _both_ 
> contents".

On the other side of the coin, major desktop apps Firefox and 
Thunderbird already use it:  Firefox uses sqlite to log open web pages 
in case of a crash, and sqlite in turn sync's its journal as any good 
database app should.  [I think tytso just got them to use fdatasync and 
a couple other improvements, to make this not-quite-so-bad]

Thunderbird hits the disk for each email received -- always wonderful 
with those 1000-email git-commit-head downloads... :)

So, arguments about "people should..." aside, existing desktops apps 
_do_ fsync and we get to deal with the bad performance :/

	Jeff



^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-27 19:19                                                         ` Gene Heskett
@ 2009-03-27 19:48                                                           ` Theodore Tso
  2009-03-27 20:02                                                             ` Aaron Cohen
                                                                               ` (2 more replies)
  0 siblings, 3 replies; 664+ messages in thread
From: Theodore Tso @ 2009-03-27 19:48 UTC (permalink / raw)
  To: Gene Heskett
  Cc: Linus Torvalds, Alan Cox, Matthew Garrett, Andrew Morton,
	David Rees, Jesper Krogh, Linux Kernel Mailing List

On Fri, Mar 27, 2009 at 03:19:10PM -0400, Gene Heskett wrote:
> >You've said for a long that that ext3 is really bad in that it
> >inflicts this --- I agree with you.  People should use other
> >filesystems which are better.  This includes ext4, which is completely
> >format compatible with ext3.  They don't even have to switch on
> >extents support to get better behaviour.  Just mounting an ext3
> >filesystem with ext4 will result in better behaviour.
> 
> Ohkay.  But in a 'make xconfig' of 2.6.28.9, how much of ext4 can be turned on 
> without rendering the old ext3 fstab defaults incompatible should I be forced 
> to boot a kernel with no ext4 support?

Ext4 doesn't make any non-backwards compatible changes to the
filesystem.  So if you just take an ext3 filesystem, and mount it as
ext4, it will work just fine; you will get delayed allocation, you
will get a slightly boosted write priority for kjournald, and then
when you unmount it, that filesystem will work *just* *fine* on a
kernel with no ext4 support.  You can mount it as an ext3 filesystem.

If you use tune2fs to enable various ext4 features, such as extents,
etc., then when you mount the filesystem as ext4, you will get the
benefit of extents for any new files which are created, and once you
do that, the filesystem can't be mounted on an ext3-only system, since
ext3 doesn't know how to deal with extents.

And of course, if you want *all* of ext4's benefits, including the
full factor of 6-8 improvement in fsck times, then you will be best
served by creating a new ext4 filesystem from scratch and doing a
backup/reformat/restore pass.

But if you're just annoyed by the large latencies in Ingo's "make
-j32" example, simply taking the ext3 filesystem and mounting it as
ext4 should make those problems go away.  And it won't make any
incompatible changes to the filesystem.  (This didn't use to be true
in the pre-2.6.26 days, but I insisted on getting this fixed so people
could always mount an ext2 or ext3 filesystems using ext4 without the
kernel making any irreversible filesystem format changes behind the
user's back.)

					- Ted

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-27 19:43                                                     ` Jeff Garzik
@ 2009-03-27 20:01                                                       ` Theodore Tso
  2009-03-27 22:20                                                         ` Jeff Garzik
  2009-03-27 21:46                                                       ` Linus Torvalds
  1 sibling, 1 reply; 664+ messages in thread
From: Theodore Tso @ 2009-03-27 20:01 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Linus Torvalds, Matthew Garrett, Alan Cox, Andrew Morton,
	David Rees, Jesper Krogh, Linux Kernel Mailing List

On Fri, Mar 27, 2009 at 03:43:03PM -0400, Jeff Garzik wrote:
> On the other side of the coin, major desktop apps Firefox and  
> Thunderbird already use it:  Firefox uses sqlite to log open web pages  
> in case of a crash, and sqlite in turn sync's its journal as any good  
> database app should.  [I think tytso just got them to use fdatasync and  
> a couple other improvements, to make this not-quite-so-bad]

I spent a very productive hour-long conversation with the Sqlite
maintainer last weekend.  He's already checked in a change to use
fdatasync() everywhere, and he's looking into other changes that would
help avoid needing to do a metadata sync because i_size has changed.
One thing that will definitely help is if applications send the
sqlite-specific SQL command "PRAGMA journal_mode = PERSIST;" when they
first startup the Sqlite database connection.  This will cause Sqlite
to keep the rollback journal file to stick around instead of being
deleted and then recreated for each Sqlite transaction.  This avoids
at least one fsync() of the directory containing the rollback journal
file.  Combined with the change in Sqlite's development branch to use
fdatasync() everwhere that fsync() is used, this should definitely be
a huge improvement.

In addition, Firefox 3.1 is reportedly going to use an union of an
on-disk database and an in-memory database, and every 15 or 30 minutes
or so (presumably tunable via some config parameter), the in-memory
database changes will be synched out to the on-disk database.  This
will *definitely* help a lot, and also help improve SSD endurance.

(Right now Firefox 3.0 writes 2.5 megabytes each time you click on a
URL, not counting the Firefox cache; I have my Firefox cache directory
symlinked to /tmp to save on unnecessary SSD writes, and I was still
recording 2600k written to the filesystem each time I clicked on a
HTML link.  This means that for every 400 pages that I visit, Firefox
is currently generating a full gigabyte of (in my view, unnecessary)
writes to my SSD, all in the name of maintaining Firefox's "Awesome
Bar".  This rather nasty behaviour should hopefully be significantly
improved with Firefox 3.1, or so the Sqlite maintainer tells me.)

	      	      	      	     	    	       - Ted

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-27 19:48                                                           ` Theodore Tso
@ 2009-03-27 20:02                                                             ` Aaron Cohen
       [not found]                                                             ` <727e50150903271301l36cff340l33e813bf6f77b4b@mail.gmail.com>
  2009-03-27 22:37                                                             ` Gene Heskett
  2 siblings, 0 replies; 664+ messages in thread
From: Aaron Cohen @ 2009-03-27 20:02 UTC (permalink / raw)
  To: Theodore Tso, Gene Heskett, Linus Torvalds, Alan Cox,
	Matthew Garrett, Andrew Morton, David Rees, Jesper Krogh,
	Linux Kernel Mailing List

> And of course, if you want *all* of ext4's benefits, including the
> full factor of 6-8 improvement in fsck times, then you will be best
> served by creating a new ext4 filesystem from scratch and doing a
> backup/reformat/restore pass.
>

Does a newly create ext4 partition have all the various goodies
enabled that I'd want, or do I also need to tune2fs some parameters to
get an "optimal" setup?

-- Aaron

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
       [not found]                                                             ` <727e50150903271301l36cff340l33e813bf6f77b4b@mail.gmail.com>
@ 2009-03-27 20:04                                                               ` Theodore Tso
  0 siblings, 0 replies; 664+ messages in thread
From: Theodore Tso @ 2009-03-27 20:04 UTC (permalink / raw)
  To: aaron
  Cc: Gene Heskett, Linus Torvalds, Alan Cox, Matthew Garrett,
	Andrew Morton, David Rees, Jesper Krogh,
	Linux Kernel Mailing List

On Fri, Mar 27, 2009 at 04:01:50PM -0400, Aaron Cohen wrote:
> > And of course, if you want *all* of ext4's benefits, including the
> > full factor of 6-8 improvement in fsck times, then you will be best
> > served by creating a new ext4 filesystem from scratch and doing a
> > backup/reformat/restore pass.
> >
> >
> Does a newly create ext4 partition have all the various goodies enabled that
> I'd want, or do I also need to tune2fs some parameters to get an "optimal"
> setup?

A newly created ext4 partition created with e2fsprogs 1.41.x will have
all of the various goodies enabled.  Note that some of what "goodies"
are enabled are controlled by the mke2fs.conf file, which some
distribution packages treat as a config file, so you need to make sure
it is appropriately updated when you update e2fsprogs.

						- Ted

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-27 19:32                                                           ` Theodore Tso
@ 2009-03-27 20:11                                                             ` Andreas T.Auer
  2009-03-27 22:01                                                               ` Linus Torvalds
  2009-03-31  9:58                                                             ` Neil Brown
  1 sibling, 1 reply; 664+ messages in thread
From: Andreas T.Auer @ 2009-03-27 20:11 UTC (permalink / raw)
  To: Theodore Tso
  Cc: Alan Cox, Linus Torvalds, Matthew Garrett, Andrew Morton,
	David Rees, Jesper Krogh, Linux Kernel Mailing List



On 2009-03-27 20:32 Theodore Tso wrote:
> We could potentially make close() imply fbarrier(), but there are
> plenty of times when that might not be such a great idea.  If we do
> that, we're back to requiring synchronous data writes for all files on
> close()
fbarrier() on close() would only mean, that the data shouldn't be
written after the metadata and new metadata shouldn't be written
_before_ old metadata, so you can also delay the committing of the
"dirty" metadata until the real data are written. You don't need
synchronous data writes necessarily.

> The fundamental idea here is not all files need to be forced to disk
> on close.  Not all files need fsync(), or even fbarrier().
An fbarrier() on close() would reflect the thinking of a lot of
developers. You might call them stupid and incompetent, but they surely
are the majority. When closing A before creating B, they don't expect
seeing B without a completed A, even though they might expect that
neither A nor B may be written yet, if the system crashes.
If you have smart developers, you might give them something new, so they
could speed things up with some extra code, e.g. when they create data,
which may be restored by other means, but the default behavior of
automatic fbarrier() on close() would be better.

Andreas

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-27 18:40                                                       ` Linus Torvalds
  2009-03-27 19:00                                                         ` Alan Cox
@ 2009-03-27 20:27                                                         ` Felipe Contreras
  1 sibling, 0 replies; 664+ messages in thread
From: Felipe Contreras @ 2009-03-27 20:27 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Alan Cox, Matthew Garrett, Theodore Tso, Andrew Morton,
	David Rees, Jesper Krogh, Linux Kernel Mailing List

On Fri, Mar 27, 2009 at 8:40 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
>
> On Fri, 27 Mar 2009, Alan Cox wrote:
>>
>> > So the fact is, "people should always use fsync" simply isn't a realistic
>> > expectation, nor is it historically accurate.
>>
>> Far too many people don't - and it is unfortunate but people should learn
>> to write quality software.
>
> You're ignoring reality.
>
> Your definition of "quality software" is PURE SH*T.
>
> Look at that laptop disk spinup issue. Look at the performance issue. Look
> at something as nebulous as "usability".
>
> If adding fsync's makes software unusable (and it does), then you
> shouldn't call that "quality software".
>
> Alan, just please face that reality, and think about it for a moment. If
> fsync() was instantaneous, this discussion wouldn't exist. But read the
> thread. We're talking 3-5s under NORMAL load, with peaks of minutes.

We are looking at the wrong problem, the problem is not "should
userspace apps do fsync", the problem is "how do we ensure reliable
data where it's needed".

It would be great if as a user I could have the option to set an fsync
level and say; look, I have a fast fs, and I really care about data
reliability in this server, so, level=0; or, hmm, what is this data
reliability thing? I just want my phone to don't be so damn slow,
level=5.

-- 
Felipe Contreras

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-27 14:35                                     ` Christoph Hellwig
  2009-03-27 15:03                                       ` Ric Wheeler
@ 2009-03-27 20:38                                       ` Jeff Garzik
  2009-03-28  0:14                                         ` Alan Cox
  2009-03-29  8:25                                         ` Christoph Hellwig
  1 sibling, 2 replies; 664+ messages in thread
From: Jeff Garzik @ 2009-03-27 20:38 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Theodore Tso, Jens Axboe, Linus Torvalds, Ingo Molnar, Alan Cox,
	Arjan van de Ven, Andrew Morton, Peter Zijlstra, Nick Piggin,
	David Rees, Jesper Krogh, Linux Kernel Mailing List

Christoph Hellwig wrote:
> And please add a tuneable for the flush.  Preferable a generic one at
> the block device layer instead of the current mess where every
> filesystem has a slightly different option for barrier usage.

At the very least, IMO the block layer should be able to notice when 
barriers need not be translated into cache flushes.  Most notably when 
wb cache is disabled on the drive, something easy to auto-detect, but 
probably a manual switch also, for people with enterprise battery-backed 
storage and such.

	Jeff



^ permalink raw reply	[flat|nested] 664+ messages in thread

* [PATCH] issue storage dev flush from generic file_fsync helper
  2009-03-26  3:24                                   ` [PATCH v2] issue storage device flush via sync_blockdev() Jeff Garzik
  2009-03-27  2:50                                     ` Theodore Tso
@ 2009-03-27 20:50                                     ` Jeff Garzik
  2009-03-29  8:25                                       ` Christoph Hellwig
  1 sibling, 1 reply; 664+ messages in thread
From: Jeff Garzik @ 2009-03-27 20:50 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Theodore Tso, Ingo Molnar, Alan Cox, Arjan van de Ven,
	Andrew Morton, Peter Zijlstra, Nick Piggin, David Rees,
	Jesper Krogh, Linux Kernel Mailing List


Simple and legacy blkdev-based filesystems such HFS, HFS+, ADFS, AFFS,
FAT, bfs, UFS, NTFS, and qnx4 all use file_fsync as their fsync(2)
VFS helper implementation.

Add a storage dev cache flush, to actually provide the guarantees that
are promised with fsync(2).

Signed-off-by: Jeff Garzik <jgarzik@redhat.com>
---

Out of 18 other places that call sync_blockdev(), only 3-4 are in
filesystems that arguably do not need or want a blkdev flush.  This
patch below clearly only addresses 1 out of ~15 callsites that really do
want metadata, data, and everything in between flushed to disk at the
sync_blockdev() callsite.

It should be noted that other calls are NOT used in fsync(2), but
rather than with guaranteed written data prior to major events such
as unmount, journal close, MD consistency check, etc.


diff --git a/fs/sync.c b/fs/sync.c
index a16d53e..24bb2f4 100644
--- a/fs/sync.c
+++ b/fs/sync.c
@@ -5,6 +5,7 @@
 #include <linux/kernel.h>
 #include <linux/file.h>
 #include <linux/fs.h>
+#include <linux/blkdev.h>
 #include <linux/module.h>
 #include <linux/sched.h>
 #include <linux/writeback.h>
@@ -72,6 +73,13 @@ int file_fsync(struct file *filp, struct dentry *dentry, int datasync)
 	err = sync_blockdev(sb->s_bdev);
 	if (!ret)
 		ret = err;
+
+	err = blkdev_issue_flush(sb->s_bdev, NULL);
+	if (err == -EOPNOTSUPP)
+		err = 0;
+	if (!ret)
+		ret = err;
+
 	return ret;
 }
 

^ permalink raw reply related	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-27 11:24                                   ` Theodore Tso
  2009-03-27 12:17                                     ` Linux 2.6.29 - delayed metadata for delayed allocation? Andreas T.Auer
  2009-03-27 14:51                                     ` Linux 2.6.29 Matthew Garrett
@ 2009-03-27 21:11                                     ` Jeremy Fitzhardinge
  2009-03-28  7:45                                       ` Bojan Smojver
  2 siblings, 1 reply; 664+ messages in thread
From: Jeremy Fitzhardinge @ 2009-03-27 21:11 UTC (permalink / raw)
  To: Theodore Tso, Matthew Garrett, Linus Torvalds, Andrew Morton,
	David Rees, Jesper Krogh, Linux Kernel Mailing List

Theodore Tso wrote:
> When I was growing up we were trained to *always* check error returns
> from *all* system calls, and to *always* fsync() if it was critical
> that the data survive a crash.  That was what competent Unix
> programmers did.  And if you are always checking error returns, the
> difference in the Lines of Code between doing it right and doing
> really wasn't that big --- and again, back then fsync() wan't
> expensive.  Making fsync expensive was ext3's data=ordered mode's
> fault.

This is a fairly narrow view of correct and possible.  How can you make 
"cat" fsync? grep? sort?  How do they know they're not dealing with 
critical data?  Apps in general don't know, because "criticality" is a 
property of the data itself and how its used, not the tools operating on it.

My point isn't that "there should be a way of doing fsync from a shell 
script" (which is probably true anyway), but that authors can't 
generally anticipate when their program is going to be dealing with 
something important.  The conservative approach would be to fsync all 
data on every close, but that's almost certainly the wrong thing for 
everyone.

If the filesystem has reasonably strong inherent data-preserving 
properties, then that's much better than scattering fsync everywhere.

fsync obviously makes sense in specific applications; it makes sense to 
fsync when you're guaranteeing that a database commit hits stable 
storage, etc.  But generic tools can't reasonably perform fsyncs, and 
its not reasonable to say that "important data is always handled by 
special important data tools".

    J

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: ext3 IO latency measurements (was: Linux 2.6.29)
  2009-03-26 18:35                                   ` Andrew Morton
@ 2009-03-27 21:26                                     ` Jan Kara
  0 siblings, 0 replies; 664+ messages in thread
From: Jan Kara @ 2009-03-27 21:26 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Ingo Molnar, Linus Torvalds, Theodore Tso, Alan Cox,
	Arjan van de Ven, Peter Zijlstra, Nick Piggin, Jens Axboe,
	David Rees, Jesper Krogh, Linux Kernel Mailing List,
	Oleg Nesterov, Roland McGrath

On Thu 26-03-09 11:35:29, Andrew Morton wrote:
> 
> The patch looks OK to me.
> 
> On Thu, 26 Mar 2009 19:11:06 +0100 Jan Kara <jack@suse.cz> wrote:
> 
> > @@ -1490,6 +1494,16 @@ static int ext3_ordered_writepage(struct page *page,
> >  	if (ext3_journal_current_handle())
> >  		goto out_fail;
> >  
> > +	if (!page_has_buffers(page)) {
> > +		create_empty_buffers(page, inode->i_sb->s_blocksize,
> > +				(1 << BH_Dirty)|(1 << BH_Uptodate));
> 
> This will attach dirty buffers to a clean page, which is an invalid
> state (but OK if we immediately fix it up).
  Yes - actually the page has been dirty just the moment before when we run
clear_page_dirty_for_io() - and at this function could have also created
the clean page with dirty buffers...

> > +	} else if (!walk_page_buffers(NULL, page_buffers(page), 0, PAGE_CACHE_SIZE, NULL, buffer_unmapped)) {
> > +		/* Provide NULL instead of get_block so that we catch bugs if buffers weren't really mapped */
> > +		return block_write_full_page(page, NULL, wbc);
> > +	}
> > +	page_bufs = page_buffers(page);
> > +
> > +
> >  	handle = ext3_journal_start(inode, ext3_writepage_trans_blocks(inode));
> >  
> >  	if (IS_ERR(handle)) {
> 
> And if this error happens we'll go on to run
> redirty_page_for_writepage() which will do the right thing.
> 
> However if PageMappedToDisk() is working right, we should be able to
> avoid that newly-added buffer walk.  Possibly SetPageMappedToDisk()
> isn't being run in all the right places though, dunno.
  Yes, SetPageMappedToDisk is set only by block_read_full_page(),
mpage_readpage() and nobh_write_begin(). Obviously not enough... It would
be nice to improve that but that's another story...

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: ext3 IO latency measurements (was: Linux 2.6.29)
  2009-03-26 22:57                                     ` Andrew Morton
@ 2009-03-27 21:38                                       ` Jan Kara
  2009-03-27 22:10                                         ` Linus Torvalds
  0 siblings, 1 reply; 664+ messages in thread
From: Jan Kara @ 2009-03-27 21:38 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, Ingo Molnar, Theodore Tso, Alan Cox,
	Arjan van de Ven, Peter Zijlstra, Nick Piggin, Jens Axboe,
	David Rees, Jesper Krogh, Linux Kernel Mailing List,
	Oleg Nesterov, Roland McGrath

On Thu 26-03-09 15:57:25, Andrew Morton wrote:
> On Thu, 26 Mar 2009 15:39:53 -0700 (PDT) Linus Torvalds <torvalds@linux-foundation.org> wrote:
> 
> > On Thu, 26 Mar 2009, Jan Kara wrote:
> > >
> > >   Reads are measurably better with the patch - the test with cat you
> > > describe below took ~0.5s per file without the patch and always less than
> > > 0.02s with the patch. So it seems to help something.
> > 
> > That would seem to be a _huge_ improvement.
> 
> It's strange that we still don't have an ext3_writepages().  Open a
> transaction, do a large pile of writes, close the transaction again. 
> We don't even have a data=writeback writepages() implementation, which
> should be fairly simple.
  Doable but not fairly simple ;) Firstly you have to restart a transaction
when you've used up all the credits you originally started with (easy),
secondly ext3 uses lock order PageLock -> "transaction start" which is
unusable for the scheme you suggest. So we'd have to revert that - which
needs larger audit of our locking scheme and that's probably the reason
why noone has done it yet.

> Bizarre.
> 
> Mingming had a shot at it a few years ago and I think Badari did as
> well, but I guess it didn't work out.
> 
> Falling back to generic_writepages() on our main local fs is a bit lame.

									Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-27 19:43                                                     ` Jeff Garzik
  2009-03-27 20:01                                                       ` Theodore Tso
@ 2009-03-27 21:46                                                       ` Linus Torvalds
  2009-03-27 22:06                                                         ` Jeff Garzik
  1 sibling, 1 reply; 664+ messages in thread
From: Linus Torvalds @ 2009-03-27 21:46 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Matthew Garrett, Alan Cox, Theodore Tso, Andrew Morton,
	David Rees, Jesper Krogh, Linux Kernel Mailing List



On Fri, 27 Mar 2009, Jeff Garzik wrote:
> 
> On the other side of the coin, major desktop apps Firefox and Thunderbird
> already use it:  Firefox uses sqlite  [...]

You do know that Firefox had to _disable_ fsync() exactly because not 
disabling it was unacceptable? That whole "why does firefox stop for 5 
seconds" thing created too many bug-reports.

> So, arguments about "people should..." aside, existing desktops apps _do_
> fsync and we get to deal with the bad performance :/

No they don't. Read up on it. Really.

Guys, I don't understand why you even argue. I've been complaining about 
fsync() performance for the last five years or so. It's taken you a long 
time to finally realize, and you still don't seem to "get it".

PEOPLE LITERALLY REMOVE 'fsync()' CALLS BECAUSE THEY ARE UNACCEPTABLE FOR 
USERS.

It really is that simple.

			Linus

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-27 20:11                                                             ` Andreas T.Auer
@ 2009-03-27 22:01                                                               ` Linus Torvalds
  0 siblings, 0 replies; 664+ messages in thread
From: Linus Torvalds @ 2009-03-27 22:01 UTC (permalink / raw)
  To: Andreas T.Auer
  Cc: Theodore Tso, Alan Cox, Matthew Garrett, Andrew Morton,
	David Rees, Jesper Krogh, Linux Kernel Mailing List



On Fri, 27 Mar 2009, Andreas T.Auer wrote:
> 
> > The fundamental idea here is not all files need to be forced to disk
> > on close.  Not all files need fsync(), or even fbarrier().
>
> An fbarrier() on close() would reflect the thinking of a lot of
> developers.

It also happens to be what pretty much all network filesystems end up 
implementing.

That said, there's a reason many people prefer local filesystems to even 
high-performance NFS - latency (especially for metadata which even modern 
versions of NFS cannot cache effectively) just sucks when you have to go 
over the network. It pretty much doesn't matter _how_ fast your network or 
server is.

One thing that might make sense is to make "close()" start background 
writeout for that file (modulo issues like laptop mode) with low priority. 

No, it obviously doesn't guarantee any kind of filesystem coherency, but 
it _does_ mean that the window for the bad cases goes from potentially 30 
seconds down to fractions of seconds. That's likely quite a bit of 
improvement in practice.

IOW, no "hard barriers", but simply more of a "even in the absense of 
fsync we simply aim for the user to have to be _really_ unlucky to ever 
hit any bad cases".

			Linus

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-27 21:46                                                       ` Linus Torvalds
@ 2009-03-27 22:06                                                         ` Jeff Garzik
  2009-03-27 22:19                                                           ` Linus Torvalds
  0 siblings, 1 reply; 664+ messages in thread
From: Jeff Garzik @ 2009-03-27 22:06 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Matthew Garrett, Alan Cox, Theodore Tso, Andrew Morton,
	David Rees, Jesper Krogh, Linux Kernel Mailing List

Linus Torvalds wrote:
> 
> On Fri, 27 Mar 2009, Jeff Garzik wrote:
>> On the other side of the coin, major desktop apps Firefox and Thunderbird
>> already use it:  Firefox uses sqlite  [...]
> 
> You do know that Firefox had to _disable_ fsync() exactly because not 
> disabling it was unacceptable? That whole "why does firefox stop for 5 
> seconds" thing created too many bug-reports.
> 
>> So, arguments about "people should..." aside, existing desktops apps _do_
>> fsync and we get to deal with the bad performance :/
> 
> No they don't. Read up on it. Really.

What is in Fedora 10 and Debian lenny's iceweasel both definitely sync 
to disk, as of today, according to my own tests.

I'm talking about what's in real world user's hands today, not some 
hoped-for future version in developer CVS somewhere, depending on build 
options and who knows what else...

	Jeff




^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: ext3 IO latency measurements (was: Linux 2.6.29)
  2009-03-27 21:38                                       ` Jan Kara
@ 2009-03-27 22:10                                         ` Linus Torvalds
  2009-03-28 19:43                                           ` Andrew Morton
  0 siblings, 1 reply; 664+ messages in thread
From: Linus Torvalds @ 2009-03-27 22:10 UTC (permalink / raw)
  To: Jan Kara
  Cc: Andrew Morton, Ingo Molnar, Theodore Tso, Alan Cox,
	Arjan van de Ven, Peter Zijlstra, Nick Piggin, Jens Axboe,
	David Rees, Jesper Krogh, Linux Kernel Mailing List,
	Oleg Nesterov, Roland McGrath



On Fri, 27 Mar 2009, Jan Kara wrote:
>
>   Doable but not fairly simple ;) Firstly you have to restart a transaction
> when you've used up all the credits you originally started with (easy),
> secondly ext3 uses lock order PageLock -> "transaction start" which is
> unusable for the scheme you suggest. So we'd have to revert that - which
> needs larger audit of our locking scheme and that's probably the reason
> why noone has done it yet.

It's also not clear that ext3 can really do much better than the regular 
generic_writepages() logic. I mean, seriously, what's there to improve on? 
The transaction code is all normally totally pointless, and I merged the 
patch that avoids it when not necessary. 

It might be different if more people used "data=journal", but I don't 
doubt that is very common. For data=writeback and data=ordered, I bet 
generic_writepages() is as good as anything ext3-specific could be.

			Linus

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-27 22:06                                                         ` Jeff Garzik
@ 2009-03-27 22:19                                                           ` Linus Torvalds
  2009-03-27 22:25                                                             ` Linus Torvalds
                                                                               ` (2 more replies)
  0 siblings, 3 replies; 664+ messages in thread
From: Linus Torvalds @ 2009-03-27 22:19 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Matthew Garrett, Alan Cox, Theodore Tso, Andrew Morton,
	David Rees, Jesper Krogh, Linux Kernel Mailing List



On Fri, 27 Mar 2009, Jeff Garzik wrote:
>
> What is in Fedora 10 and Debian lenny's iceweasel both definitely sync to
> disk, as of today, according to my own tests.

Hmm. Go to "about:config" and check your "toolkit.storage.synchronous" 
setting.

It _should_ say

	default integer 0

and that is what it says for me (yes, on Fedora 10).

The values are: 0 = off, 1 = normal, 2 = full.

If you don't have that "toolkit.storage.synchronous" entry, that means 
that you have an older version of firefox-3. And if you have some other 
value, it either means somebody changed it, or that Fedora is shipping 
with multiple different versions (the "official" Firefox source code 
defaults to 1, I think, but they suggested distributions change the 
default to 0).

		Linus

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-27 20:01                                                       ` Theodore Tso
@ 2009-03-27 22:20                                                         ` Jeff Garzik
  0 siblings, 0 replies; 664+ messages in thread
From: Jeff Garzik @ 2009-03-27 22:20 UTC (permalink / raw)
  To: Theodore Tso, Jeff Garzik, Linus Torvalds, Matthew Garrett,
	Alan Cox, Andrew Morton, David Rees, Jesper Krogh,
	Linux Kernel Mailing List

Theodore Tso wrote:
> On Fri, Mar 27, 2009 at 03:43:03PM -0400, Jeff Garzik wrote:
>> On the other side of the coin, major desktop apps Firefox and  
>> Thunderbird already use it:  Firefox uses sqlite to log open web pages  
>> in case of a crash, and sqlite in turn sync's its journal as any good  
>> database app should.  [I think tytso just got them to use fdatasync and  
>> a couple other improvements, to make this not-quite-so-bad]
> 
> I spent a very productive hour-long conversation with the Sqlite
> maintainer last weekend.  He's already checked in a change to use
> fdatasync() everywhere, and he's looking into other changes that would
> help avoid needing to do a metadata sync because i_size has changed.
> One thing that will definitely help is if applications send the
> sqlite-specific SQL command "PRAGMA journal_mode = PERSIST;" when they
> first startup the Sqlite database connection.  This will cause Sqlite
> to keep the rollback journal file to stick around instead of being
> deleted and then recreated for each Sqlite transaction.  This avoids
> at least one fsync() of the directory containing the rollback journal
> file.  Combined with the change in Sqlite's development branch to use
> fdatasync() everwhere that fsync() is used, this should definitely be
> a huge improvement.
> 
> In addition, Firefox 3.1 is reportedly going to use an union of an
> on-disk database and an in-memory database, and every 15 or 30 minutes
> or so (presumably tunable via some config parameter), the in-memory
> database changes will be synched out to the on-disk database.  This
> will *definitely* help a lot, and also help improve SSD endurance.

Definitely, though it will be an interesting balance once user feedback 
starts to roll in...

Firefox started doing this stuff because, when it or the window system 
or OS crashed, users like my wife would not lose the 50+ tabs they've 
opened and were actively using.  :)

So it's hard to see how users will react to going back to the days when 
firefox crashes once again mean lost work.  [referring to the 15-30 min 
delay, not fsync(2)]

	Jeff





^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-27 22:19                                                           ` Linus Torvalds
@ 2009-03-27 22:25                                                             ` Linus Torvalds
  2009-03-28  1:19                                                               ` Jeff Garzik
  2009-03-28  0:18                                                             ` Jeff Garzik
  2009-03-28  2:16                                                             ` Mark Lord
  2 siblings, 1 reply; 664+ messages in thread
From: Linus Torvalds @ 2009-03-27 22:25 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Matthew Garrett, Alan Cox, Theodore Tso, Andrew Morton,
	David Rees, Jesper Krogh, Linux Kernel Mailing List



On Fri, 27 Mar 2009, Linus Torvalds wrote:
> 
> The values are: 0 = off, 1 = normal, 2 = full.

Of course, I don't actually know that "off" really means "never fsync". It 
may be that it only cuts down on the number of fsync's. I do know that 
firefox with the original defaults ("fsync everywhere") was totally 
unusable, and that got fixed.

But maybe it got fixed to "only pauses occasionally" rather than "every 
single page load brings everything to a screetching halt".

Of course, your browsing history database is an excellent example of 
something you should _not_ care about that much, and where performance is 
a lot more important than "ooh, if the machine goes down suddenly, I need 
to be 100% up-to-date". Using fsync on that thing was just stupid, even 
regardless of any ext3 issues.

		Linus

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-27 19:48                                                           ` Theodore Tso
  2009-03-27 20:02                                                             ` Aaron Cohen
       [not found]                                                             ` <727e50150903271301l36cff340l33e813bf6f77b4b@mail.gmail.com>
@ 2009-03-27 22:37                                                             ` Gene Heskett
  2009-03-27 22:55                                                               ` Theodore Tso
  2 siblings, 1 reply; 664+ messages in thread
From: Gene Heskett @ 2009-03-27 22:37 UTC (permalink / raw)
  To: Theodore Tso, Linus Torvalds, Alan Cox, Matthew Garrett,
	Andrew Morton, David Rees, Jesper Krogh,
	Linux Kernel Mailing List

On Friday 27 March 2009, Theodore Tso wrote:
>On Fri, Mar 27, 2009 at 03:19:10PM -0400, Gene Heskett wrote:
>> >You've said for a long that that ext3 is really bad in that it
>> >inflicts this --- I agree with you.  People should use other
>> >filesystems which are better.  This includes ext4, which is completely
>> >format compatible with ext3.  They don't even have to switch on
>> >extents support to get better behaviour.  Just mounting an ext3
>> >filesystem with ext4 will result in better behaviour.
>>
>> Ohkay.  But in a 'make xconfig' of 2.6.28.9, how much of ext4 can be
>> turned on without rendering the old ext3 fstab defaults incompatible
>> should I be forced to boot a kernel with no ext4 support?
>
>Ext4 doesn't make any non-backwards compatible changes to the
>filesystem.  So if you just take an ext3 filesystem, and mount it as
>ext4, it will work just fine; you will get delayed allocation, you
>will get a slightly boosted write priority for kjournald, and then
>when you unmount it, that filesystem will work *just* *fine* on a
>kernel with no ext4 support.  You can mount it as an ext3 filesystem.
>
>If you use tune2fs to enable various ext4 features, such as extents,
>etc., then when you mount the filesystem as ext4, you will get the
>benefit of extents for any new files which are created, and once you
>do that, the filesystem can't be mounted on an ext3-only system, since
>ext3 doesn't know how to deal with extents.
>
>And of course, if you want *all* of ext4's benefits, including the
>full factor of 6-8 improvement in fsck times, then you will be best
>served by creating a new ext4 filesystem from scratch and doing a
>backup/reformat/restore pass.
>
>But if you're just annoyed by the large latencies in Ingo's "make
>-j32" example, simply taking the ext3 filesystem and mounting it as
>ext4 should make those problems go away.  And it won't make any
>incompatible changes to the filesystem.  (This didn't use to be true
>in the pre-2.6.26 days, but I insisted on getting this fixed so people
>could always mount an ext2 or ext3 filesystems using ext4 without the
>kernel making any irreversible filesystem format changes behind the
>user's back.)
>
>					- Ted

Thanks Ted, I will build 2.6.28.9 with this:
[root@coyote linux-2.6.28.9]# grep EXT .config
[...]
CONFIG_PAGEFLAGS_EXTENDED=y
CONFIG_EXT2_FS=m
CONFIG_EXT2_FS_XATTR=y
CONFIG_EXT2_FS_POSIX_ACL=y
# CONFIG_EXT2_FS_SECURITY is not set
CONFIG_EXT2_FS_XIP=y
CONFIG_EXT3_FS=m
CONFIG_EXT3_FS_XATTR=y
CONFIG_EXT3_FS_POSIX_ACL=y
CONFIG_EXT3_FS_SECURITY=y
CONFIG_EXT4_FS=y
# CONFIG_EXT4DEV_COMPAT is not set
# CONFIG_EXT4_FS_XATTR is not set
CONFIG_GENERIC_FIND_NEXT_BIT=y

Anything there that isn't compatible?

I'll build that, but only switch the /amandatapes mount in fstab for testing 
tonight unless you spot something above.

Thanks.
>To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>the body of a message to majordomo@vger.kernel.org
>More majordomo info at  http://vger.kernel.org/majordomo-info.html
>Please read the FAQ at  http://www.tux.org/lkml/


-- 
Cheers, Gene
"There are four boxes to be used in defense of liberty:
 soap, ballot, jury, and ammo. Please use in that order."
-Ed Howdershelt (Author)
Losing your drivers' license is just God's way of saying "BOOGA, BOOGA!"


^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-27 22:37                                                             ` Gene Heskett
@ 2009-03-27 22:55                                                               ` Theodore Tso
  2009-03-28  0:42                                                                 ` Gene Heskett
  0 siblings, 1 reply; 664+ messages in thread
From: Theodore Tso @ 2009-03-27 22:55 UTC (permalink / raw)
  To: Gene Heskett
  Cc: Linus Torvalds, Alan Cox, Matthew Garrett, Andrew Morton,
	David Rees, Jesper Krogh, Linux Kernel Mailing List

On Fri, Mar 27, 2009 at 06:37:08PM -0400, Gene Heskett wrote:
> Thanks Ted, I will build 2.6.28.9 with this:
> [root@coyote linux-2.6.28.9]# grep EXT .config
> [...]
> CONFIG_PAGEFLAGS_EXTENDED=y
> CONFIG_EXT2_FS=m
> CONFIG_EXT2_FS_XATTR=y
> CONFIG_EXT2_FS_POSIX_ACL=y
> # CONFIG_EXT2_FS_SECURITY is not set
> CONFIG_EXT2_FS_XIP=y
> CONFIG_EXT3_FS=m
> CONFIG_EXT3_FS_XATTR=y
> CONFIG_EXT3_FS_POSIX_ACL=y
> CONFIG_EXT3_FS_SECURITY=y
> CONFIG_EXT4_FS=y
> # CONFIG_EXT4DEV_COMPAT is not set
> # CONFIG_EXT4_FS_XATTR is not set
> CONFIG_GENERIC_FIND_NEXT_BIT=y
> 
> Anything there that isn't compatible?

Well, if you need extended attributes (if you are using SELinux, then
you need extended attributes) you'll want to enable
CONFIG_EXT4_FS_XATTR.

If you want to use ext4 on your root filesystem, you may need to take
some special measures depending on your distribution.  Using the boot
command-line option rootfstype=ext4 will work on many distributions,
but I haven't tested all of them.  It definitely works on Ubuntu, and
it should work if you're not using an initial ramdisk.

Oh yeah; the other thing I should warn you about is that 2.6.28.9
won't have the replace-via-rename and replace-via-truncate
workarounds.  So if you crash and your applications aren't using
fsync(), you could end up seeing the zero-length files.  I very much
doubt that will make a big difference for your /amandatapes partition,
but if you want to use this for the filesystem where you have home
directory, you'll probably want the workaround patches.  I've heard
reports of KDE users telling me that when they initial start up their
desktop, literally hundreds of files are rewritten by their desktop,
just starting it up.  (Why?  Who knows?  It's not good for SSD
endurance, in any case.)  But if you crash while initially logging in,
your KDE configuration files might get wiped out w/o the
replace-via-rename and replace-via-truncate workaround patches.

> I'll build that, but only switch the /amandatapes mount in fstab for testing 
> tonight unless you spot something above.

OK, so you're not worried about your root filesystem, and presumably
the issue with your home directory won't be an issue for you either.
The only question then is whether you need extended attribute support.

Regards,

					- Ted

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-27 20:38                                       ` Jeff Garzik
@ 2009-03-28  0:14                                         ` Alan Cox
  2009-03-29  8:25                                         ` Christoph Hellwig
  1 sibling, 0 replies; 664+ messages in thread
From: Alan Cox @ 2009-03-28  0:14 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Christoph Hellwig, Theodore Tso, Jens Axboe, Linus Torvalds,
	Ingo Molnar, Arjan van de Ven, Andrew Morton, Peter Zijlstra,
	Nick Piggin, David Rees, Jesper Krogh, Linux Kernel Mailing List

On Fri, 27 Mar 2009 16:38:35 -0400
Jeff Garzik <jeff@garzik.org> wrote:

> Christoph Hellwig wrote:
> > And please add a tuneable for the flush.  Preferable a generic one at
> > the block device layer instead of the current mess where every
> > filesystem has a slightly different option for barrier usage.
> 
> At the very least, IMO the block layer should be able to notice when 
> barriers need not be translated into cache flushes.  Most notably when 
> wb cache is disabled on the drive, something easy to auto-detect, but 
> probably a manual switch also, for people with enterprise battery-backed 
> storage and such.

The storage drivers for those cases already generally know this and treat
cache flush requests as "has hit nvram", even the non enterprise ones.


^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-27 22:19                                                           ` Linus Torvalds
  2009-03-27 22:25                                                             ` Linus Torvalds
@ 2009-03-28  0:18                                                             ` Jeff Garzik
  2009-03-28  1:45                                                               ` Linus Torvalds
  2009-03-28  2:16                                                             ` Mark Lord
  2 siblings, 1 reply; 664+ messages in thread
From: Jeff Garzik @ 2009-03-28  0:18 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Matthew Garrett, Alan Cox, Theodore Tso, Andrew Morton,
	David Rees, Jesper Krogh, Linux Kernel Mailing List

Linus Torvalds wrote:
> 
> On Fri, 27 Mar 2009, Jeff Garzik wrote:
>> What is in Fedora 10 and Debian lenny's iceweasel both definitely sync to
>> disk, as of today, according to my own tests.
> 
> Hmm. Go to "about:config" and check your "toolkit.storage.synchronous" 
> setting.
> 
> It _should_ say
> 
> 	default integer 0
> 
> and that is what it says for me (yes, on Fedora 10).
> 
> The values are: 0 = off, 1 = normal, 2 = full.

Definitely a difference!   1 for both, here.  Deb is a fresh OS install 
and fresh homedir, but my F10 has been through many OS and ff config 
upgrades over the years.

	Jeff




^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-27 22:55                                                               ` Theodore Tso
@ 2009-03-28  0:42                                                                 ` Gene Heskett
  0 siblings, 0 replies; 664+ messages in thread
From: Gene Heskett @ 2009-03-28  0:42 UTC (permalink / raw)
  To: Theodore Tso, Linus Torvalds, Alan Cox, Matthew Garrett,
	Andrew Morton, David Rees, Jesper Krogh,
	Linux Kernel Mailing List

On Friday 27 March 2009, Theodore Tso wrote:
>On Fri, Mar 27, 2009 at 06:37:08PM -0400, Gene Heskett wrote:
>> Thanks Ted, I will build 2.6.28.9 with this:
>> [root@coyote linux-2.6.28.9]# grep EXT .config
>> [...]
>> CONFIG_PAGEFLAGS_EXTENDED=y
>> CONFIG_EXT2_FS=m
>> CONFIG_EXT2_FS_XATTR=y
>> CONFIG_EXT2_FS_POSIX_ACL=y
>> # CONFIG_EXT2_FS_SECURITY is not set
>> CONFIG_EXT2_FS_XIP=y
>> CONFIG_EXT3_FS=m
>> CONFIG_EXT3_FS_XATTR=y
>> CONFIG_EXT3_FS_POSIX_ACL=y
>> CONFIG_EXT3_FS_SECURITY=y
>> CONFIG_EXT4_FS=y
>> # CONFIG_EXT4DEV_COMPAT is not set
>> # CONFIG_EXT4_FS_XATTR is not set
>> CONFIG_GENERIC_FIND_NEXT_BIT=y
>>
>> Anything there that isn't compatible?
>
>Well, if you need extended attributes (if you are using SELinux, then
>you need extended attributes) you'll want to enable
>CONFIG_EXT4_FS_XATTR.
>
>If you want to use ext4 on your root filesystem, you may need to take
>some special measures depending on your distribution.  Using the boot
>command-line option rootfstype=ext4 will work on many distributions,
>but I haven't tested all of them.  It definitely works on Ubuntu, and
>it should work if you're not using an initial ramdisk.
>
>Oh yeah; the other thing I should warn you about is that 2.6.28.9
>won't have the replace-via-rename and replace-via-truncate
>workarounds.  So if you crash and your applications aren't using
>fsync(), you could end up seeing the zero-length files.  I very much
>doubt that will make a big difference for your /amandatapes partition,
>but if you want to use this for the filesystem where you have home
>directory, you'll probably want the workaround patches.  I've heard
>reports of KDE users telling me that when they initial start up their
>desktop, literally hundreds of files are rewritten by their desktop,
>just starting it up.  (Why?  Who knows?  It's not good for SSD
>endurance, in any case.)  But if you crash while initially logging in,
>your KDE configuration files might get wiped out w/o the
>replace-via-rename and replace-via-truncate workaround patches.
>
>> I'll build that, but only switch the /amandatapes mount in fstab for
>> testing tonight unless you spot something above.
>
>OK, so you're not worried about your root filesystem, and presumably
>the issue with your home directory won't be an issue for you either.
>The only question then is whether you need extended attribute support.
>
>Regards,
>
>					- Ted

Thanks Ted, its building w/o the extra CONFIG_EXT4_FS_XATTR atm, but I'll 
enable that and do it again before I reboot.  I had just fired off the build 
when I saw your answer. NBD, my 'makeit' script is pretty complete.

Thank you.

-- 
Cheers, Gene
"There are four boxes to be used in defense of liberty:
 soap, ballot, jury, and ammo. Please use in that order."
-Ed Howdershelt (Author)
The only rose without thorns is friendship.


^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-27 22:25                                                             ` Linus Torvalds
@ 2009-03-28  1:19                                                               ` Jeff Garzik
  2009-03-28  1:30                                                                 ` David Miller
                                                                                   ` (3 more replies)
  0 siblings, 4 replies; 664+ messages in thread
From: Jeff Garzik @ 2009-03-28  1:19 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Matthew Garrett, Alan Cox, Theodore Tso, Andrew Morton,
	David Rees, Jesper Krogh, Linux Kernel Mailing List

Linus Torvalds wrote:
> Of course, your browsing history database is an excellent example of 
> something you should _not_ care about that much, and where performance is 
> a lot more important than "ooh, if the machine goes down suddenly, I need 
> to be 100% up-to-date". Using fsync on that thing was just stupid, even 

If you are doing a ton of web-based work with a bunch of tabs or windows 
open, you really like the post-crash restoration methods that Firefox 
now employs.  Some users actually do want to checkpoint/restore their 
web work, regardless of whether it was the browser, the window system or 
the OS that crashed.

You may not care about that, but others do care about the integrity of 
the database that stores the active FF state (Web URLs currently open), 
a database which necessarily changes for each URL visited.



As an aside, I find it highly ironic that Firefox gained useful session 
management around the same time that some GNOME jarhead no-op'd GNOME 
session management[1] in X.

	Jeff



[1] http://np237.livejournal.com/22014.html


^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-28  1:19                                                               ` Jeff Garzik
@ 2009-03-28  1:30                                                                 ` David Miller
  2009-03-28  2:19                                                                 ` Mark Lord
                                                                                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 664+ messages in thread
From: David Miller @ 2009-03-28  1:30 UTC (permalink / raw)
  To: jeff; +Cc: torvalds, mjg59, alan, tytso, akpm, drees76, jesper, linux-kernel

From: Jeff Garzik <jeff@garzik.org>
Date: Fri, 27 Mar 2009 21:19:12 -0400

> As an aside, I find it highly ironic that Firefox gained useful
> session management around the same time that some GNOME jarhead
> no-op'd GNOME session management[1] in X.

Great, now all the KDE boo-birds might have to switch back,
or even go to xfce4.

If KDE and GNOME both make a bad release at the same time,
then we'll really be in trouble. :-)


^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-28  0:18                                                             ` Jeff Garzik
@ 2009-03-28  1:45                                                               ` Linus Torvalds
  2009-03-28  2:53                                                                 ` Jeff Garzik
  0 siblings, 1 reply; 664+ messages in thread
From: Linus Torvalds @ 2009-03-28  1:45 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Matthew Garrett, Alan Cox, Theodore Tso, Andrew Morton,
	David Rees, Jesper Krogh, Linux Kernel Mailing List



On Fri, 27 Mar 2009, Jeff Garzik wrote:
> 
> Definitely a difference!   1 for both, here.  Deb is a fresh OS install and
> fresh homedir, but my F10 has been through many OS and ff config upgrades over
> the years.

Hmm. I wonder where firefox gets its defaults then. 

I can well imagine that Debian has a different firefox build, with 
different defaults. But if your F10 thing also is set to 1, and still 
shows as "default", then that's odd, considering that mine shows 0.

I have 'rpm -q firefox': firefox-3.0.7-1.fc10.x86_64.

Is yours a 32-bit one? Maybe it comes with different defaults?

And maybe firefox just has a very odd config setup and I don't understand 
what "default" means at all. Gene says he doesn't have that 
toolkit.storage.synchronous thing at all.

		Linus


^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-27 22:19                                                           ` Linus Torvalds
  2009-03-27 22:25                                                             ` Linus Torvalds
  2009-03-28  0:18                                                             ` Jeff Garzik
@ 2009-03-28  2:16                                                             ` Mark Lord
  2009-03-28  2:38                                                               ` Linus Torvalds
  2 siblings, 1 reply; 664+ messages in thread
From: Mark Lord @ 2009-03-28  2:16 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jeff Garzik, Matthew Garrett, Alan Cox, Theodore Tso,
	Andrew Morton, David Rees, Jesper Krogh,
	Linux Kernel Mailing List

Linus Torvalds wrote:
> 
> On Fri, 27 Mar 2009, Jeff Garzik wrote:
>> What is in Fedora 10 and Debian lenny's iceweasel both definitely sync to
>> disk, as of today, according to my own tests.
> 
> Hmm. Go to "about:config" and check your "toolkit.storage.synchronous" 
> setting.
..
> If you don't have that "toolkit.storage.synchronous" entry, that means 
> that you have an older version of firefox-3.
..


Okay, I'll bite.  Exactly which version of FF has that variable?
Cuz it ain't in the FF 3.0.8 that I'm running here.

Thanks

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-28  1:19                                                               ` Jeff Garzik
  2009-03-28  1:30                                                                 ` David Miller
@ 2009-03-28  2:19                                                                 ` Mark Lord
  2009-03-28  2:49                                                                   ` Jeff Garzik
  2009-03-29  0:33                                                                 ` david
  2009-03-31 15:01                                                                 ` Thierry Vignaud
  3 siblings, 1 reply; 664+ messages in thread
From: Mark Lord @ 2009-03-28  2:19 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Linus Torvalds, Matthew Garrett, Alan Cox, Theodore Tso,
	Andrew Morton, David Rees, Jesper Krogh,
	Linux Kernel Mailing List

Jeff Garzik wrote:
> Linus Torvalds wrote:
>> Of course, your browsing history database is an excellent example of 
>> something you should _not_ care about that much, and where performance 
>> is a lot more important than "ooh, if the machine goes down suddenly, 
>> I need to be 100% up-to-date". Using fsync on that thing was just 
>> stupid, even 
> 
> If you are doing a ton of web-based work with a bunch of tabs or windows 
> open, you really like the post-crash restoration methods that Firefox 
> now employs.  Some users actually do want to checkpoint/restore their 
> web work, regardless of whether it was the browser, the window system or 
> the OS that crashed.
> 
> You may not care about that, but others do care about the integrity of 
> the database that stores the active FF state (Web URLs currently open), 
> a database which necessarily changes for each URL visited.
..

fsync() isn't going to affect that one way or another
unless the entire kernel freezes and dies.

Firefox locks up the GUI here from time to time,
but the kernel still flushes pages to disk,
and even more quickly when alt-sysrq-s is used.

Cheers

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-28  2:16                                                             ` Mark Lord
@ 2009-03-28  2:38                                                               ` Linus Torvalds
  2009-03-28 11:57                                                                 ` Andreas T.Auer
  0 siblings, 1 reply; 664+ messages in thread
From: Linus Torvalds @ 2009-03-28  2:38 UTC (permalink / raw)
  To: Mark Lord
  Cc: Jeff Garzik, Matthew Garrett, Alan Cox, Theodore Tso,
	Andrew Morton, David Rees, Jesper Krogh,
	Linux Kernel Mailing List



On Fri, 27 Mar 2009, Mark Lord wrote:
>
> Okay, I'll bite.  Exactly which version of FF has that variable?
> Cuz it ain't in the FF 3.0.8 that I'm running here.

I _thought_ it was there since rc2 of FF-3, but clearly there are odd 
things afoot. You're the second person to report it not there.

I'd suspect that I mistyped it, but I just cut-and-pasted it from my email 
to make sure. Maybe you did. What happens if you just write "sync" in the 
Filter: box? Nothing matches?

Do you see firefox pausing a lot under disk load? If you just add that 
"toolkit.storage.synchronous" value by hand (right-click in the preference 
window, do "New" -> "Integer"), and write it in as zero, does it change 
behavior?

		Linus

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-28  2:19                                                                 ` Mark Lord
@ 2009-03-28  2:49                                                                   ` Jeff Garzik
  2009-03-28 13:29                                                                     ` Stefan Richter
  0 siblings, 1 reply; 664+ messages in thread
From: Jeff Garzik @ 2009-03-28  2:49 UTC (permalink / raw)
  To: Mark Lord
  Cc: Linus Torvalds, Matthew Garrett, Alan Cox, Theodore Tso,
	Andrew Morton, David Rees, Jesper Krogh,
	Linux Kernel Mailing List

Mark Lord wrote:
> fsync() isn't going to affect that one way or another
> unless the entire kernel freezes and dies.

  ...which is one of the three common crash scenarios listed (and 
experienced in the field).

	Jeff





^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-28  1:45                                                               ` Linus Torvalds
@ 2009-03-28  2:53                                                                 ` Jeff Garzik
  2009-03-28  2:56                                                                   ` Zid Null
  2009-03-28  3:55                                                                   ` Gene Heskett
  0 siblings, 2 replies; 664+ messages in thread
From: Jeff Garzik @ 2009-03-28  2:53 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Matthew Garrett, Alan Cox, Theodore Tso, Andrew Morton,
	David Rees, Jesper Krogh, Linux Kernel Mailing List

Linus Torvalds wrote:
> 
> On Fri, 27 Mar 2009, Jeff Garzik wrote:
>> Definitely a difference!   1 for both, here.  Deb is a fresh OS install and
>> fresh homedir, but my F10 has been through many OS and ff config upgrades over
>> the years.
> 
> Hmm. I wonder where firefox gets its defaults then. 
> 
> I can well imagine that Debian has a different firefox build, with 
> different defaults. But if your F10 thing also is set to 1, and still 
> shows as "default", then that's odd, considering that mine shows 0.
> 
> I have 'rpm -q firefox': firefox-3.0.7-1.fc10.x86_64.
> 
> Is yours a 32-bit one? Maybe it comes with different defaults?
> 
> And maybe firefox just has a very odd config setup and I don't understand 
> what "default" means at all. Gene says he doesn't have that 
> toolkit.storage.synchronous thing at all.

In my case the toolkit.storage.synchronous is present in both, set to 1 
in Deb and bolded and set to 1 in F10 (firefox-3.0.7-1.fc10.x86_64).

The latter's bold typeface makes me think my F10 FF 
toolkit.storage.synchronous setting is NOT set to the F10 default -- 
although I have never heard of this setting, and have certainly not 
manually tweaked it.  The only FF setting I manually tweak is cache 
directory.

	Jeff



^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-28  2:53                                                                 ` Jeff Garzik
@ 2009-03-28  2:56                                                                   ` Zid Null
  2009-03-28  3:55                                                                   ` Gene Heskett
  1 sibling, 0 replies; 664+ messages in thread
From: Zid Null @ 2009-03-28  2:56 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Linus Torvalds, Matthew Garrett, Alan Cox, Theodore Tso,
	Andrew Morton, David Rees, Jesper Krogh,
	Linux Kernel Mailing List

2009/3/28 Jeff Garzik <jeff@garzik.org>:
> Linus Torvalds wrote:
>>
>> On Fri, 27 Mar 2009, Jeff Garzik wrote:
>>>
>>> Definitely a difference!   1 for both, here.  Deb is a fresh OS install
>>> and
>>> fresh homedir, but my F10 has been through many OS and ff config upgrades
>>> over
>>> the years.
>>
>> Hmm. I wonder where firefox gets its defaults then.
>> I can well imagine that Debian has a different firefox build, with
>> different defaults. But if your F10 thing also is set to 1, and still shows
>> as "default", then that's odd, considering that mine shows 0.
>>
>> I have 'rpm -q firefox': firefox-3.0.7-1.fc10.x86_64.
>>
>> Is yours a 32-bit one? Maybe it comes with different defaults?
>>
>> And maybe firefox just has a very odd config setup and I don't understand
>> what "default" means at all. Gene says he doesn't have that
>> toolkit.storage.synchronous thing at all.
>
> In my case the toolkit.storage.synchronous is present in both, set to 1 in
> Deb and bolded and set to 1 in F10 (firefox-3.0.7-1.fc10.x86_64).

I compiled my own firefox under gentoo, not present.
Mozilla Firefox 3.0.7, Copyright (c) 1998 - 2009 mozilla.org

> The latter's bold typeface makes me think my F10 FF
> toolkit.storage.synchronous setting is NOT set to the F10 default --
> although I have never heard of this setting, and have certainly not manually
> tweaked it.  The only FF setting I manually tweak is cache directory.
>
>        Jeff
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
>

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-28  2:53                                                                 ` Jeff Garzik
  2009-03-28  2:56                                                                   ` Zid Null
@ 2009-03-28  3:55                                                                   ` Gene Heskett
  2009-03-28 11:29                                                                     ` Alejandro Riveira Fernández
  1 sibling, 1 reply; 664+ messages in thread
From: Gene Heskett @ 2009-03-28  3:55 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Linus Torvalds, Matthew Garrett, Alan Cox, Theodore Tso,
	Andrew Morton, David Rees, Jesper Krogh,
	Linux Kernel Mailing List

On Friday 27 March 2009, Jeff Garzik wrote:
>Linus Torvalds wrote:
>> On Fri, 27 Mar 2009, Jeff Garzik wrote:
>>> Definitely a difference!   1 for both, here.  Deb is a fresh OS install
>>> and fresh homedir, but my F10 has been through many OS and ff config
>>> upgrades over the years.
>>
>> Hmm. I wonder where firefox gets its defaults then.
>>
>> I can well imagine that Debian has a different firefox build, with
>> different defaults. But if your F10 thing also is set to 1, and still
>> shows as "default", then that's odd, considering that mine shows 0.
>>
>> I have 'rpm -q firefox': firefox-3.0.7-1.fc10.x86_64.
>>
>> Is yours a 32-bit one? Maybe it comes with different defaults?
>>
>> And maybe firefox just has a very odd config setup and I don't understand
>> what "default" means at all. Gene says he doesn't have that
>> toolkit.storage.synchronous thing at all.
>
>In my case the toolkit.storage.synchronous is present in both, set to 1
>in Deb and bolded and set to 1 in F10 (firefox-3.0.7-1.fc10.x86_64).
>
>The latter's bold typeface makes me think my F10 FF
>toolkit.storage.synchronous setting is NOT set to the F10 default --
>although I have never heard of this setting, and have certainly not
>manually tweaked it.  The only FF setting I manually tweak is cache
>directory.
>
>	Jeff

I just let FF update itself to 3.0.8 (from mozilla, not fedora) and there is 
no 'toolkit' stuff whatsoever in about:config.

Is this perchance some extension I don't have installed?
>
>--
>To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>the body of a message to majordomo@vger.kernel.org
>More majordomo info at  http://vger.kernel.org/majordomo-info.html
>Please read the FAQ at  http://www.tux.org/lkml/


-- 
Cheers, Gene
"There are four boxes to be used in defense of liberty:
 soap, ballot, jury, and ammo. Please use in that order."
-Ed Howdershelt (Author)
Jacquin's Postulate on Democratic Government:
	No man's life, liberty, or property are safe while the
	legislature is in session.


^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-27  1:25                         ` Andrew Morton
                                             ` (2 preceding siblings ...)
  2009-03-27  3:38                           ` Linus Torvalds
@ 2009-03-28  5:06                           ` Ingo Molnar
  2009-04-01 21:03                           ` Lennart Sorensen
  4 siblings, 0 replies; 664+ messages in thread
From: Ingo Molnar @ 2009-03-28  5:06 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, Theodore Tso, David Rees, Jesper Krogh,
	Linux Kernel Mailing List


* Andrew Morton <akpm@linux-foundation.org> wrote:

> On Thu, 26 Mar 2009 18:03:15 -0700 (PDT) Linus Torvalds <torvalds@linux-foundation.org> wrote:
> 
> > On Thu, 26 Mar 2009, Andrew Morton wrote:
> > > 
> > > userspace can get closer than the kernel can.
> > 
> > Andrew, that's SIMPLY NOT TRUE.
> > 
> > You state that without any amount of data to back it up, as if it was some 
> > kind of truism. It's not.
> 
> I've seen you repeatedly fiddle the in-kernel defaults based on 
> in-field experience.  That could just as easily have been done in 
> initscripts by distros, and much more effectively because it 
> doesn't need a new kernel.  That's data.
> 
> The fact that this hasn't even been _attempted_ (afaik) is 
> deplorable.
> 
> Why does everyone just sit around waiting for the kernel to put a 
> new value into two magic numbers which userspace scripts could 
> have set?

Three reasons.

Firstly, this utterly does not scale.

Microsoft has built an empire on the 'power of the default settings' 
- why cannot Linux kernel developers finally realize the obvious: 
that setting defaults centrally is an incredibly efficient way of 
shaping the end result?

The second reason is that in the past 10 years we have gone through 
a couple of toxic cycles of distros trying to work around kernel 
behavior by setting sysctls. That was done and then forgotten, and a 
few years down the line some kernel maintainer found [related to a 
bugreport] that distro X set that sysctl to value Y which now had a 
different behavior and immediately chastised the distro broken and 
refused to touch the bugreport and refused bugreports from that 
distro from that point on.

We've seen this again, and again, and i remember 2-3 specific 
examples and i know how badly this experience trickles down on the 
distro side.

The end result: pretty much any tuning of kernel defaults is done 
extremely reluctantly by distros. They consider kernel behavior a 
domain of the kernel, and they dont generally risk going away from 
the default. [ In other words, distro developers understand the 
'power of defaults' a lot better than kernel developers ... ]

This is also true in the reverse direction: they dont actually mind 
the kernel doing a central change of policy, if it's a general step 
forward. Distro developers are very practical, and they are a lot 
less hardline about the sacred Unix principle of separation of 
kernel from policy.

Thirdly: the latency of getting changes to users. A new kernel is 
released every 3 months. Distros are released every 6 months. A new 
Firefox major version is released about once a year. A new major GCC 
is released every three years.

Given the release frequency and given our goal to minimize the 
latency of getting improvements to users, which of these projects is 
best suited to introduce a new default value? [and no, such changes 
are not generally done in minor package updates.]

	Ingo

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-27 21:11                                     ` Jeremy Fitzhardinge
@ 2009-03-28  7:45                                       ` Bojan Smojver
  2009-03-28  8:43                                         ` Bojan Smojver
  0 siblings, 1 reply; 664+ messages in thread
From: Bojan Smojver @ 2009-03-28  7:45 UTC (permalink / raw)
  To: linux-kernel

Jeremy Fitzhardinge <jeremy <at> goop.org> writes:

> This is a fairly narrow view of correct and possible.  How can you make 
> "cat" fsync? grep? sort?  How do they know they're not dealing with 
> critical data?  Apps in general don't know, because "criticality" is a 
> property of the data itself and how its used, not the tools operating on it.

Isn't it possible to compile a program that simply calls open()/fsync()/close()
on a given file name? If yes, then in your scripts, you can do whatever you want
with existing tools on a _scratch_ file, then call your fsync program on that
scratch file and then rename it to the real file. No?

In other words, given that you know that your data is critical, you will write
processed data to another file, while preserving the original, store the new
file safely and then rename it to the original. Just like the apps that know
that their files are critical are supposed to do using the API.

--
Bojan




^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-27 16:02       ` Linus Torvalds
@ 2009-03-28  7:50         ` Mike Galbraith
  2009-03-30 22:00         ` Hans-Peter Jansen
  1 sibling, 0 replies; 664+ messages in thread
From: Mike Galbraith @ 2009-03-28  7:50 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Geert Uytterhoeven, Hans-Peter Jansen, linux-kernel

On Fri, 2009-03-27 at 09:02 -0700, Linus Torvalds wrote:
> 
> On Fri, 27 Mar 2009, Mike Galbraith wrote:
> > > 
> > > If you're using the kernel-of-they-day, you're probably using git, and
> > > CONFIG_LOCALVERSION_AUTO=y should be mandatory.
> > 
> > I sure hope it never becomes mandatory, I despise that thing.  I don't
> > even do -rc tags.  .nn is .nn until baked and nn.1 appears.
> 
> If you're a git user that changes kernels frequently, then enabling 
> CONFIG_LOCALVERSION_AUTO is _really_ convenient when you learn to use it.
> 
> This is quite common for me:
> 
> 	gitk v$(uname -r)..
> 
> and it works exactly due to CONFIG_LOCALVERSION_AUTO (and because git is 
> rather good at figuring out version numbers). It's a great way to say 
> "ok, what is in my git tree that I'm not actually running right now".
> 
> Another case where CONFIG_LOCALVERSION_AUTO is very useful is when you're 
> noticing some new broken behavior, but it took you a while to notice. 
> You've rebooted several times since, but you know it worked last Tuesday. 
> What do you do?
> 
> The thing to do is
> 
> 	grep "Linux version" /var/log/messages*
> 
> and figure out what the good version was, and then do 
> 
> 	git bisect start
> 	git bisect good ..that-version..
> 	git bisect bad v$(uname -r)
> 
> and off you go. This is _very_ convenient if you are working with some 
> "random git kernel of the day" like I am (and like hopefully others are 
> too, in order to get test coverage).

That's why it irritates me.  I build/test a lot, and do the occasional
bisection, which makes a mess in /boot and /lib/modules.  I use a quilt
stack of git pull diffs as reference/rummage points.  Awkward maybe, but
effective (so no need for autoversion), and no mess.

	-Mike


^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-28  7:45                                       ` Bojan Smojver
@ 2009-03-28  8:43                                         ` Bojan Smojver
  2009-03-28 21:55                                           ` Bojan Smojver
  0 siblings, 1 reply; 664+ messages in thread
From: Bojan Smojver @ 2009-03-28  8:43 UTC (permalink / raw)
  To: linux-kernel

Bojan Smojver <bojan <at> rexursive.com> writes:

> Isn't it possible to compile a program that simply calls open()/fsync()/close()
> on a given file name?

That was stupid. Ignore me.

--
Bojan





^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-28  3:55                                                                   ` Gene Heskett
@ 2009-03-28 11:29                                                                     ` Alejandro Riveira Fernández
  0 siblings, 0 replies; 664+ messages in thread
From: Alejandro Riveira Fernández @ 2009-03-28 11:29 UTC (permalink / raw)
  To: Gene Heskett
  Cc: Jeff Garzik, Linus Torvalds, Matthew Garrett, Alan Cox,
	Theodore Tso, Andrew Morton, David Rees, Jesper Krogh,
	Linux Kernel Mailing List

El Fri, 27 Mar 2009 23:55:06 -0400
Gene Heskett <gene.heskett@verizon.net> escribió:


> 
> I just let FF update itself to 3.0.8 (from mozilla, not fedora) and there is 
> no 'toolkit' stuff whatsoever in about:config.

 I do not have it either FF 3.0.8 Ubuntu 8.10. it does not appear searching with
sync; not toolkit nor storage...

> 
> Is this perchance some extension I don't have installed?
> >
>

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-28  2:38                                                               ` Linus Torvalds
@ 2009-03-28 11:57                                                                 ` Andreas T.Auer
  0 siblings, 0 replies; 664+ messages in thread
From: Andreas T.Auer @ 2009-03-28 11:57 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Mark Lord, Jeff Garzik, Matthew Garrett, Alan Cox, Theodore Tso,
	Andrew Morton, David Rees, Jesper Krogh,
	Linux Kernel Mailing List



On 28.03.2009 03:38 Linus Torvalds wrote:
> On Fri, 27 Mar 2009, Mark Lord wrote:
>  
>> Okay, I'll bite.  Exactly which version of FF has that variable?
>> Cuz it ain't in the FF 3.0.8 that I'm running here.
>>    
>
> I'd suspect that I mistyped it, but I just cut-and-pasted it from my
email
> to make sure. Maybe you did. What happens if you just write "sync" in the
> Filter: box? Nothing matches?
>
>  
No, not with my iceweasel 3.0.7 (Debian/testing).

I couldn't find anything in the Debian patch to the source code, but the
source code contains
toolkit/components/contentprefs/src/nsContentPrefService.js 733-746:

    // Turn off disk synchronization checking to reduce disk churn and
speed up
    // operations when prefs are changed rapidly (such as when a user
repeatedly
    // changes the value of the browser zoom setting for a site).
    //
    // Note: this could cause database corruption if the OS crashes or
machine
    // loses power before the data gets written to disk, but this is
considered
    // a reasonable risk for the not-so-critical data stored in this
database.
    //
    // If you really don't want to take this risk, however, just set the
    // toolkit.storage.synchronous pref to 1 (NORMAL synchronization) or 2
    // (FULL synchronization), in which case
mozStorageConnection::Initialize
    // will use that value, and we won't override it here.
    if (!this._prefSvc.prefHasUserValue("toolkit.storage.synchronous"))
      dbConnection.executeSimpleSQL("PRAGMA synchronous = OFF");

Probably they preferred the default value "off" so much that they even
dropped the entry in standard configuration.
> Do you see firefox pausing a lot under disk load?
I see iceweasel pausing/blocking a lot when loading stalling webpages,
but that's a different topic.




^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-28  2:49                                                                   ` Jeff Garzik
@ 2009-03-28 13:29                                                                     ` Stefan Richter
  2009-03-28 14:17                                                                       ` Jeff Garzik
  0 siblings, 1 reply; 664+ messages in thread
From: Stefan Richter @ 2009-03-28 13:29 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Mark Lord, Linus Torvalds, Matthew Garrett, Alan Cox,
	Theodore Tso, Andrew Morton, David Rees, Jesper Krogh,
	Linux Kernel Mailing List

Jeff Garzik wrote:
> Mark Lord wrote:
[store browser session]
>> fsync() isn't going to affect that one way or another
>> unless the entire kernel freezes and dies.
> 
>  ...which is one of the three common crash scenarios listed (and
> experienced in the field).

To get work done which one really cares about, one can always choose a
system which does not crash frequently.  Those who run unstable drivers
for thrills surely do it on boxes on which nothing important is being
done, one would think.
-- 
Stefan Richter
-=====-=-=== -=-= -==-=
http://arcgraph.de/sr/

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-28 13:29                                                                     ` Stefan Richter
@ 2009-03-28 14:17                                                                       ` Jeff Garzik
  2009-03-28 14:35                                                                         ` Stefan Richter
  0 siblings, 1 reply; 664+ messages in thread
From: Jeff Garzik @ 2009-03-28 14:17 UTC (permalink / raw)
  To: Stefan Richter
  Cc: Mark Lord, Linus Torvalds, Matthew Garrett, Alan Cox,
	Theodore Tso, Andrew Morton, David Rees, Jesper Krogh,
	Linux Kernel Mailing List

Stefan Richter wrote:
> Jeff Garzik wrote:
>> Mark Lord wrote:
> [store browser session]
>>> fsync() isn't going to affect that one way or another
>>> unless the entire kernel freezes and dies.
>>  ...which is one of the three common crash scenarios listed (and
>> experienced in the field).
> 
> To get work done which one really cares about, one can always choose a
> system which does not crash frequently.  Those who run unstable drivers
> for thrills surely do it on boxes on which nothing important is being
> done, one would think.

Once software is perfect, there is definitely a lot of useless crash 
protection code to remove.

	Jeff




^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-28 14:17                                                                       ` Jeff Garzik
@ 2009-03-28 14:35                                                                         ` Stefan Richter
  2009-03-28 15:17                                                                           ` Mark Lord
  2009-03-28 16:25                                                                           ` Alex Goebel
  0 siblings, 2 replies; 664+ messages in thread
From: Stefan Richter @ 2009-03-28 14:35 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Mark Lord, Linus Torvalds, Matthew Garrett, Alan Cox,
	Theodore Tso, Andrew Morton, David Rees, Jesper Krogh,
	Linux Kernel Mailing List

Jeff Garzik wrote:
> Stefan Richter wrote:
>> Jeff Garzik wrote:
>>> Mark Lord wrote:
>> [store browser session]
>>>> fsync() isn't going to affect that one way or another
>>>> unless the entire kernel freezes and dies.
>>>  ...which is one of the three common crash scenarios listed (and
>>> experienced in the field).
>>
>> To get work done which one really cares about, one can always choose a
>> system which does not crash frequently.  Those who run unstable drivers
>> for thrills surely do it on boxes on which nothing important is being
>> done, one would think.
> 
> Once software is perfect, there is definitely a lot of useless crash
> protection code to remove.

Well, for the time being, why not base considerations for performance,
interactivity, energy consumption, graceful restoration of application
state etc. on the assumption that kernel crashes are suitably rare?  (At
least on systems where data loss would be of concern.)
-- 
Stefan Richter
-=====-=-=== -=-= -==-=
http://arcgraph.de/sr/

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-28 14:35                                                                         ` Stefan Richter
@ 2009-03-28 15:17                                                                           ` Mark Lord
  2009-03-28 16:08                                                                             ` Stefan Richter
                                                                                               ` (2 more replies)
  2009-03-28 16:25                                                                           ` Alex Goebel
  1 sibling, 3 replies; 664+ messages in thread
From: Mark Lord @ 2009-03-28 15:17 UTC (permalink / raw)
  To: Stefan Richter
  Cc: Jeff Garzik, Linus Torvalds, Matthew Garrett, Alan Cox,
	Theodore Tso, Andrew Morton, David Rees, Jesper Krogh,
	Linux Kernel Mailing List

The better solution seems to be the rather obvious one:

   the filesystem should commit data to disk before altering metadata.

Much easier and more reliable to centralize it there, rather than
rely (falsely) upon thousands of programs each performing numerous
performance-killing fsync's.

Cheers

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-28 15:17                                                                           ` Mark Lord
@ 2009-03-28 16:08                                                                             ` Stefan Richter
  2009-03-28 16:32                                                                               ` Linus Torvalds
  2009-03-29  1:18                                                                             ` Jeff Garzik
  2009-03-29 23:14                                                                             ` Dave Chinner
  2 siblings, 1 reply; 664+ messages in thread
From: Stefan Richter @ 2009-03-28 16:08 UTC (permalink / raw)
  To: Mark Lord
  Cc: Jeff Garzik, Linus Torvalds, Matthew Garrett, Alan Cox,
	Theodore Tso, Andrew Morton, David Rees, Jesper Krogh,
	Linux Kernel Mailing List

I wrote:
>> Well, for the time being, why not base considerations for performance,
>> interactivity, energy consumption, graceful restoration of application
>> state etc. on the assumption that kernel crashes are suitably rare?  (At
>> least on systems where data loss would be of concern.)

In more general terms:  If overall system reliability is known
insufficient, attempt to increase reliability of lower layers first.  If
this approach alone would be too costly in implementation or use, then
also look at how to increase reliability of upper layers too.

(Example:  Running a suitably reliable kernel on a desktop for
"mission-critical web browsing" is possible at low cost, at least if
early decisions, e.g. for well-supported video hardware, went right.)


Mark Lord wrote:
> The better solution seems to be the rather obvious one:
> 
>   the filesystem should commit data to disk before altering metadata.
> 
> Much easier and more reliable to centralize it there, rather than
> rely (falsely) upon thousands of programs each performing numerous
> performance-killing fsync's.
> 
> Cheers

Sure.  I forgot:  Not only the frequency of I/O disruption (e.g. due to
kernel crash) factors into system reliability; the particular impact of
such disruption is a factor too.  (How hard is recovery?  Will at least
old data remain available? ...)
-- 
Stefan Richter
-=====-=-=== -=-= -==-=
http://arcgraph.de/sr/

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-28 14:35                                                                         ` Stefan Richter
  2009-03-28 15:17                                                                           ` Mark Lord
@ 2009-03-28 16:25                                                                           ` Alex Goebel
  2009-03-28 21:12                                                                             ` Hua Zhong
  1 sibling, 1 reply; 664+ messages in thread
From: Alex Goebel @ 2009-03-28 16:25 UTC (permalink / raw)
  To: Stefan Richter
  Cc: Jeff Garzik, Mark Lord, Linus Torvalds, Matthew Garrett,
	Alan Cox, Theodore Tso, Andrew Morton, David Rees, Jesper Krogh,
	Linux Kernel Mailing List

On 3/28/09, Stefan Richter <stefanr@s5r6.in-berlin.de> wrote:

> Well, for the time being, why not base considerations for performance,
> interactivity, energy consumption, graceful restoration of application
> state etc. on the assumption that kernel crashes are suitably rare?  (At
> least on systems where data loss would be of concern.)

Absolutely! That's what I thought all the time when following this
(meanwhile quite grotesque) discussion. Even for ordinary
home/office/laptop/desktop users (!=kernel developers), kernel crashes
are simply not a realistic scenario any more to optimize anything for
(which is due to the good work you guys are doing in making/keeping
the kernel stable).

Alex

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-28 16:08                                                                             ` Stefan Richter
@ 2009-03-28 16:32                                                                               ` Linus Torvalds
  2009-03-28 17:22                                                                                 ` David Hagood
  0 siblings, 1 reply; 664+ messages in thread
From: Linus Torvalds @ 2009-03-28 16:32 UTC (permalink / raw)
  To: Stefan Richter
  Cc: Mark Lord, Jeff Garzik, Matthew Garrett, Alan Cox, Theodore Tso,
	Andrew Morton, David Rees, Jesper Krogh,
	Linux Kernel Mailing List



On Sat, 28 Mar 2009, Stefan Richter wrote:
> 
> Sure.  I forgot:  Not only the frequency of I/O disruption (e.g. due to
> kernel crash) factors into system reliability; the particular impact of
> such disruption is a factor too.  (How hard is recovery?  Will at least
> old data remain available? ...)

I suspect (at least from my own anecdotal evidence) that a lot of system 
crashes are basically X hanging. If you use the system as a desktop, at 
that point it's basically dead - and the difference between an X hang and 
a kernel crash is almost totally invisible to users.

Us kernel people may walk over to another machine and ping or ssh in to 
see, but ask yourself how many normal users would do that - especially 
since DOS and Windows has taught people that they need to power-cycle 
(and, in all honesty, especially since there usually is very little else 
you can do even under Linux if X gets confused).

And then part of the problem ends up being that while in theory the kernel 
can continue to write out dirty stuff, in practice people press the power 
button long before it can do so. The 30 second thing is really too long.

And don't tell me about sysrq. I know about sysrq. It's very convenient 
for kernel people, but it's not like most people use it.

But I absolutely hear you - people seem to think that "correctness" trumps 
all, but in reality, quite often users will be happier with a faster 
system - even if they know that they may lose data. They may curse 
themselves (or, more likely, the system) when they _do_ lose data, but 
they'll make the same choice all over two months later.

Which is why I think that if the filesystem people think that the 
"data=ordered" mode is too damn fundamentally hard to make fast in the 
presense of "fsync", and all sane people (definition: me) think that the 
30-second window for either "data=writeback" or the ext4 data writeout is 
too fragile, then we should look into something in between.

Because, in the end, you do have to balance performance vs safety when it 
comes to disk writes. You absolutely have to delay things for performance, 
but it is always going to involve the risk of losing data that you do care 
about, but that you aren't willing (or able - random apps and tons of 
scripting comes to mind) to do a fsync over.

Which is why I, personally, would probably be perfectly happy with a 
"async ordered" mode, for example. At least START the data writeback when 
writing back metadata, but don't necessarily wait for it (and don't 
necessarily make it go first). Turn the "30 second window of death" into 
something much harder to hit.

			Linus

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-28 16:32                                                                               ` Linus Torvalds
@ 2009-03-28 17:22                                                                                 ` David Hagood
  0 siblings, 0 replies; 664+ messages in thread
From: David Hagood @ 2009-03-28 17:22 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Stefan Richter, Mark Lord, Jeff Garzik, Matthew Garrett,
	Alan Cox, Theodore Tso, Andrew Morton, David Rees, Jesper Krogh,
	Linux Kernel Mailing List

What if you added another phase in the journaling, after the data is
written to the kernel, but before block allocation.

As I understand, the current scenario goes like this:
1) A program writes a bunch of data to a file.
2) The kernel holds the data in buffer cache, delaying allocation.
3) Kernel updates file metadata in journal.
4) Some time later, kernel allocates blocks and writes data.

If things go boom between 3 and 4, you have the files in an inconsistent
state. If the program does an fasync(), then the kernel has to write ALL
data out to be consistent.

What if you could do this:

1) A program writes a bunch of data to a file.
2) The kernel holds the data in buffer cache, delaying allocation.
3) The kernel writes a record to the journal saying "This data goes with
this file, but I've not allocated any blocks for it yet."
4) Kernel updates file metadata in journal.
5) Sometime later, kernel allocates blocks for data, and notes the
allocation in the journal.
6) Sometime later still the kernel commits the data to disk and update
the journal.

It seems to me this would be a not-unreasonable way to have both the
advantages of delayed allocation AND get the data onto disk quickly.

If the user wants to have speed over safety, you could skip steps 3 and
5 (data=ordered). You want safety, you force everything through steps 3
and 5 (data=journaled). You want a middle ground, you only do steps 3
and 5 for files where the program has done an fasync() (data=ordered +
program calls fasync()).

And if you want both speed and safety, you get a big battery-backed up
RAM disk as the journal device and journal everything.



^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: ext3 IO latency measurements (was: Linux 2.6.29)
  2009-03-27 22:10                                         ` Linus Torvalds
@ 2009-03-28 19:43                                           ` Andrew Morton
  0 siblings, 0 replies; 664+ messages in thread
From: Andrew Morton @ 2009-03-28 19:43 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jan Kara, Ingo Molnar, Theodore Tso, Alan Cox, Arjan van de Ven,
	Peter Zijlstra, Nick Piggin, Jens Axboe, David Rees,
	Jesper Krogh, Linux Kernel Mailing List, Oleg Nesterov,
	Roland McGrath

On Fri, 27 Mar 2009 15:10:56 -0700 (PDT) Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Fri, 27 Mar 2009, Jan Kara wrote:
> >
> >   Doable but not fairly simple ;) Firstly you have to restart a transaction
> > when you've used up all the credits you originally started with (easy),
> > secondly ext3 uses lock order PageLock -> "transaction start" which is
> > unusable for the scheme you suggest. So we'd have to revert that - which
> > needs larger audit of our locking scheme and that's probably the reason
> > why noone has done it yet.
> 
> It's also not clear that ext3 can really do much better than the regular 
> generic_writepages() logic. I mean, seriously, what's there to improve on? 

- opening a single transaction for many pages in the cases when a
  transaction _is_ needed.

- single large BIO versus zillions of single-page BIOs.

Relatively minor benefits, but it's a bit odd that we never got around
to doing it.

It just got quite a bit harder to do, so I expect we won't be doing it.


^ permalink raw reply	[flat|nested] 664+ messages in thread

* RE: Linux 2.6.29
  2009-03-28 16:25                                                                           ` Alex Goebel
@ 2009-03-28 21:12                                                                             ` Hua Zhong
  2009-03-29  8:22                                                                               ` Stefan Richter
  0 siblings, 1 reply; 664+ messages in thread
From: Hua Zhong @ 2009-03-28 21:12 UTC (permalink / raw)
  To: 'Alex Goebel', 'Stefan Richter'
  Cc: 'Jeff Garzik', 'Mark Lord',
	'Linus Torvalds', 'Matthew Garrett',
	'Alan Cox', 'Theodore Tso',
	'Andrew Morton', 'David Rees',
	'Jesper Krogh', 'Linux Kernel Mailing List'

Good point. We should throw away all the journaling junk and just go back 
to ext2. Why pay the extra cost for something we shouldn't optimize for? 
It's not like the kernel every crashes.

> Absolutely! That's what I thought all the time when following this
> (meanwhile quite grotesque) discussion. Even for ordinary
> home/office/laptop/desktop users (!=kernel developers), kernel crashes
> are simply not a realistic scenario any more to optimize anything for
> (which is due to the good work you guys are doing in making/keeping
> the kernel stable).
> 
> Alex



^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-28  8:43                                         ` Bojan Smojver
@ 2009-03-28 21:55                                           ` Bojan Smojver
  2009-03-31 21:51                                             ` Jeremy Fitzhardinge
  0 siblings, 1 reply; 664+ messages in thread
From: Bojan Smojver @ 2009-03-28 21:55 UTC (permalink / raw)
  To: linux-kernel

Bojan Smojver <bojan <at> rexursive.com> writes:

> That was stupid. Ignore me.

And yet, FreeBSD seems to have a command just like that:

http://www.freebsd.org/cgi/man.cgi?query=fsync&sektion=1&manpath=FreeBSD+7.1-RELEASE

--
Bojan



^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-27  3:59                             ` Linus Torvalds
@ 2009-03-28 23:52                               ` david
  0 siblings, 0 replies; 664+ messages in thread
From: david @ 2009-03-28 23:52 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrew Morton, Theodore Tso, David Rees, Jesper Krogh,
	Linux Kernel Mailing List

On Thu, 26 Mar 2009, Linus Torvalds wrote:

> On Thu, 26 Mar 2009, Linus Torvalds wrote:
>>
>
> The only excuse _ever_ for user-land tweaking is if you do something
> really odd. Say that you want to get the absolutely best OLTP numbers you
> can possibly get - with no regards for _any_ other workload. In that case,
> you want to tweak the numbers for that exact load, and the exact machine
> that runs it - and the result is going to be a totally worthless number
> (since it's just benchmarketing and doesn't actually reflect any real
> world scenario), but hey, that's what benchmarketing is all about.
>
> Or say that you really are a very embedded environment, with a very
> specific load. A router, a cellphone, a base station, whatever - you do
> one thing, and you're not trying to be a general purpose machine. Then you
> can tweak for that load. But not otherwise.
>
> If you don't have any magical odd special workloads, you shouldn't need to
> tweak a single kernel knob. Because if you need to, then the kernel is
> doing something wrong to begin with.

while I agree with most of what you say, I'll point out that many 
enterprise servers really do care about one particular workload to the 
exclusion of everything else. if you can get another 10% performance by 
tuning your box for an OLTP workload and make your cluster 9 boxes instead 
of 10 it's well worth it (it ends up being better response time for users, 
less hardware, and avoiding software license costs most of the time"

this is somewhere between benchmarking and embedded, but it is a valid 
case.

most users (even most database users) don't need to go after that last 
little bit of performance, the defalts should be good enough for most 
users, no matter what workload they are running.

David Lang

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-28  1:19                                                               ` Jeff Garzik
  2009-03-28  1:30                                                                 ` David Miller
  2009-03-28  2:19                                                                 ` Mark Lord
@ 2009-03-29  0:33                                                                 ` david
  2009-03-29  1:24                                                                   ` Jeff Garzik
  2009-03-31 15:01                                                                 ` Thierry Vignaud
  3 siblings, 1 reply; 664+ messages in thread
From: david @ 2009-03-29  0:33 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Linus Torvalds, Matthew Garrett, Alan Cox, Theodore Tso,
	Andrew Morton, David Rees, Jesper Krogh,
	Linux Kernel Mailing List

On Fri, 27 Mar 2009, Jeff Garzik wrote:

> Linus Torvalds wrote:
>> Of course, your browsing history database is an excellent example of 
>> something you should _not_ care about that much, and where performance is a 
>> lot more important than "ooh, if the machine goes down suddenly, I need to 
>> be 100% up-to-date". Using fsync on that thing was just stupid, even 
>
> If you are doing a ton of web-based work with a bunch of tabs or windows 
> open, you really like the post-crash restoration methods that Firefox now 
> employs.  Some users actually do want to checkpoint/restore their web work, 
> regardless of whether it was the browser, the window system or the OS that 
> crashed.
>
> You may not care about that, but others do care about the integrity of the 
> database that stores the active FF state (Web URLs currently open), a 
> database which necessarily changes for each URL visited.

as one of those users with many windows tabs open (a couple hundred 
normally), even the curent firefox behavior isn't good enough because it 
doesn't let me _not_ load everything back in when a link I go to triggers 
a crash in firefox every time it loads.

so what I do is do a git commit in cron every min of the history file. git 
can do the fsync as needed to get it to disk reasonably without firefox 
needing to do it _for_every_click_

like laptop mode, you need to be able to define "I'm willing to loose this 
much activity in the name of performance/power"

ted's suggestion (in his blog) to tweak fsync to 'misbehave' when laptop 
mode is enabled (only pushing data out to disk when the disk is awake 
anyway, or the time has hit) would really work well for most users. 
servers (where you have the data integrity fsync useage) don't use laptop 
mode. desktops could use 'laptop mode' with a delay of 0.5 or 1 second and 
get prety close the the guarentee that users want without a huge 
performance hit.

David Lang

>
>
> As an aside, I find it highly ironic that Firefox gained useful session 
> management around the same time that some GNOME jarhead no-op'd GNOME session 
> management[1] in X.
>
> 	Jeff
>
>
>
> [1] http://np237.livejournal.com/22014.html
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
>

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-28 15:17                                                                           ` Mark Lord
  2009-03-28 16:08                                                                             ` Stefan Richter
@ 2009-03-29  1:18                                                                             ` Jeff Garzik
  2009-03-31 18:45                                                                               ` Jörn Engel
  2009-03-29 23:14                                                                             ` Dave Chinner
  2 siblings, 1 reply; 664+ messages in thread
From: Jeff Garzik @ 2009-03-29  1:18 UTC (permalink / raw)
  To: Mark Lord
  Cc: Stefan Richter, Linus Torvalds, Matthew Garrett, Alan Cox,
	Theodore Tso, Andrew Morton, David Rees, Jesper Krogh,
	Linux Kernel Mailing List

Mark Lord wrote:
> The better solution seems to be the rather obvious one:
> 
>   the filesystem should commit data to disk before altering metadata.
> 
> Much easier and more reliable to centralize it there, rather than
> rely (falsely) upon thousands of programs each performing numerous
> performance-killing fsync's.

Firstly, the FS data/metadata write-out order says nothing about when 
the write-out is started by the OS.  It only implies consistency in the 
face of a crash during write-out.  Hooray for BSD soft-updates.

If the write-out is started immediately during or after write(2), 
congratulations, you are on your way to reinventing synchronous writes.

If the write-out does not start immediately, then you have a 
many-seconds window for data loss.  And it should be self-evident that 
userland application writers will have some situations where design 
requirements dictate minimizing or eliminating that window.


Secondly, this email sub-thread is not talking about thousands of 
programs, it is talking about Firefox behavior.  Firefox is a multi-OS 
portable application that has a design requirement that user data must 
be protected against crashes.  (same concept as your word processor's 
auto-save feature)

The author of such a portable application must ensure their app saves 
data against Windows Vista kernel crashes, HPUX kernel crashes, OS X 
window system crashes, X11 window system crashes, application crashes, etc.

Can a portable app really rely on what Linux kernel hackers think the 
underlying filesystem _should_ do?

No, it is either (a) not going to care at all, or (b) uses fsync(2) or 
FlushFileBuffers() because if guarantees provided across the OS 
spectrum, in light of the myriad OS filesystem caching, flushing, and 
ordering algorithms.



Was the BSD soft-updates idea of FS data-before-metadata a good one? 
Yes.  Obviously.

It is the cornerstone of every SANE journalling-esque database or 
filesystem out there -- don't leave a window where your metadata is 
inconsistent.  "Duh" :)

But that says nothing about when a userland app's design requirements 
include ordered writes+flushes of its own application data.  That is the 
common case when a userland app like Firefox uses a transactional 
database such as sqlite or db4.

Thus it is the height of silliness to think that FS data/metadata 
write-out order permits elimination of fsync(2) for the class of 
application that must care about ordered writes/flushes of its own 
application data.

That upstream sqlite replaced fsync(2) with fdatasync(2) makes it 
obvious that FS data/metadata write-out order is irrelevant to Firefox.

The issue with transactional databases is more simply a design tradeoff 
-- level of fsync punishment versus performance etc.  Tweaking the OS 
filesystem doesn't help at all with those design choices.

	Jeff




^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-29  0:33                                                                 ` david
@ 2009-03-29  1:24                                                                   ` Jeff Garzik
  2009-03-29  3:43                                                                     ` Theodore Tso
  0 siblings, 1 reply; 664+ messages in thread
From: Jeff Garzik @ 2009-03-29  1:24 UTC (permalink / raw)
  To: david
  Cc: Linus Torvalds, Matthew Garrett, Alan Cox, Theodore Tso,
	Andrew Morton, David Rees, Jesper Krogh,
	Linux Kernel Mailing List

david@lang.hm wrote:
> ted's suggestion (in his blog) to tweak fsync to 'misbehave' when laptop 
> mode is enabled (only pushing data out to disk when the disk is awake 
> anyway, or the time has hit) would really work well for most users. 
> servers (where you have the data integrity fsync useage) don't use 
> laptop mode. desktops could use 'laptop mode' with a delay of 0.5 or 1 
> second and get prety close the the guarentee that users want without a 
> huge performance hit.

The existential struggle is overall amusing:

Application writers start using userland transactional databases for 
crash recovery and consistency, and in response, OS writers work to 
undercut the consistency guarantees currently provided by the OS.


More seriously, if we get sqlite, db4 and a few others behaving sanely 
WRT fsync, you cover a wide swath of apps all at once.

I absolutely agree that db4, sqlite and friends need to be smarter in 
the case of laptop mode or overall power saving.

	Jeff



^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-29  1:24                                                                   ` Jeff Garzik
@ 2009-03-29  3:43                                                                     ` Theodore Tso
  2009-03-29  4:53                                                                       ` Jeff Garzik
  0 siblings, 1 reply; 664+ messages in thread
From: Theodore Tso @ 2009-03-29  3:43 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: david, Linus Torvalds, Matthew Garrett, Alan Cox, Andrew Morton,
	David Rees, Jesper Krogh, Linux Kernel Mailing List

On Sat, Mar 28, 2009 at 09:24:59PM -0400, Jeff Garzik wrote:
>> ted's suggestion (in his blog) to tweak fsync to 'misbehave' when 
>> laptop mode is enabled (only pushing data out to disk when the disk is 
>> awake anyway, or the time has hit) would really work well for most 
>> users. servers (where you have the data integrity fsync useage) don't 
>> use laptop mode. desktops could use 'laptop mode' with a delay of 0.5 
>> or 1 second and get prety close the the guarentee that users want 
>> without a huge performance hit.
>
> The existential struggle is overall amusing:
>
> Application writers start using userland transactional databases for  
> crash recovery and consistency, and in response, OS writers work to  
> undercut the consistency guarantees currently provided by the OS.

Actually, it makes a lot of sense, if you think about it in this way. 

The requirement is this; by default, data which is critical shouldn't
be lost.  (Whether this should be done by the filesystem performing
magic, or the application/database programmer being careful about
using fsync --- and whether we should treat all files as critical and
to hell with performance, or only those which the application has
designated as precious or nonprecious --- there is some dispute.)

However, the system administrator should be able to say, "I want
laptop mode functionality", and with the turn of a single dial, be
able to say, "In order to save batteries, I'm OK with losing up to X
seconds/minutes worth of work."  I would envision a control panel GUI
where there is one checkbox, "enable laptop mode", and another
checkbox, "enable laptop mode only when on battery" (which is greyed
out unless the first is checkbox is enabled), and then a slidebar
which allows the user to set how many seconds and/or minutes the user
is willing to lose if the system crashes.

At that point, it's up to the user.  Maybe the defaults should be
something like 15 seconds; maybe the defaults should be 5 seconds.
Maybe the defaults should be automatically set to different values by
different distributions, depending on whether said distro is willing
to use badly unstable proprietary bindary video drivers that crash if
you look at them funny.

The advantage of such a scheme is that there's a single knob for the
user to control, instead one for each application.  And fundamentally,
it should be OK for a user of the desktop and/or the system
administrator to make this tradeoff.  That's where the choice belongs;
not to the application writer, and not to the filesystem maintainer,
or OS programmers in general.

If I have an Lenovo X61s which is rock solid stable, with Intel video
drivers, I might be willing to risk lose up to 10 minutes of work,
secure in the knowledge it's highly unlikely to happen.  If I'm an
Ubuntu user with so super-unstable proprietary video driver, maybe I'd
be more comfortable with this being 5 or 10 seconds.  But if we leave
it up to the user, and they have an easy-to-use control panel that
controls it, the user can decide for themself where they want to trade
off performance, battery life, and potential window for data loss.

So having some mode where we can suspend all writes to the disk for up
to a user-defined limit --- and then once the disk wakes up, for
reading or for writing, we flush out all dirty data --- makes a lot of
sense.  Laptop mode does most of this already, except that it doesn't
intercept fsync() requests.  And as long as the user has given
permission to the operating system to defer fsync() requests by up to
some user-specified time limit, IMHO that's completely fair game.

							- Ted

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-29  3:43                                                                     ` Theodore Tso
@ 2009-03-29  4:53                                                                       ` Jeff Garzik
  0 siblings, 0 replies; 664+ messages in thread
From: Jeff Garzik @ 2009-03-29  4:53 UTC (permalink / raw)
  To: Theodore Tso, Matthew Garrett, Alan Cox
  Cc: david, Linus Torvalds, Andrew Morton, David Rees, Jesper Krogh,
	Linux Kernel Mailing List

Theodore Tso wrote:
> So having some mode where we can suspend all writes to the disk for up
> to a user-defined limit --- and then once the disk wakes up, for
> reading or for writing, we flush out all dirty data --- makes a lot of
> sense.  Laptop mode does most of this already, except that it doesn't
> intercept fsync() requests.  And as long as the user has given
> permission to the operating system to defer fsync() requests by up to
> some user-specified time limit, IMHO that's completely fair game.


Overall I agree, but I would rewrite that as:  it's fair game as long as 
the OS doesn't undercut the deliberate write ordering performed by the 
userland application.

When the "laptop mode fsync plug" is uncorked, writes should not be 
merged across an fsync(2) barrier; otherwise it becomes impossible to 
build transactional databases with any consistency guarantees at all.

	Jeff




^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-28 21:12                                                                             ` Hua Zhong
@ 2009-03-29  8:22                                                                               ` Stefan Richter
  0 siblings, 0 replies; 664+ messages in thread
From: Stefan Richter @ 2009-03-29  8:22 UTC (permalink / raw)
  To: Hua Zhong
  Cc: 'Alex Goebel', 'Jeff Garzik', 'Mark Lord',
	'Linus Torvalds', 'Matthew Garrett',
	'Alan Cox', 'Theodore Tso',
	'Andrew Morton', 'David Rees',
	'Jesper Krogh', 'Linux Kernel Mailing List'

Hua Zhong wrote:
> Good point. We should throw away all the journaling junk and just go back 
> to ext2. Why pay the extra cost for something we shouldn't optimize for? 
> It's not like the kernel every crashes.

The previous two posts were about assumptions at the level of
application software, not at the kernel level.
-- 
Stefan Richter
-=====-=-=== -=-= -==-=
http://arcgraph.de/sr/

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: [PATCH] issue storage dev flush from generic file_fsync helper
  2009-03-27 20:50                                     ` [PATCH] issue storage dev flush from generic file_fsync helper Jeff Garzik
@ 2009-03-29  8:25                                       ` Christoph Hellwig
  2009-03-30  1:25                                         ` Fernando Luis Vázquez Cao
  0 siblings, 1 reply; 664+ messages in thread
From: Christoph Hellwig @ 2009-03-29  8:25 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Linus Torvalds, Theodore Tso, Ingo Molnar, Alan Cox,
	Arjan van de Ven, Andrew Morton, Peter Zijlstra, Nick Piggin,
	David Rees, Jesper Krogh, Linux Kernel Mailing List

Looks good except that we still need a tuning know for it.  Preferably
one that works for these filesystems and all the existing barrier using
ones.


^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-27 20:38                                       ` Jeff Garzik
  2009-03-28  0:14                                         ` Alan Cox
@ 2009-03-29  8:25                                         ` Christoph Hellwig
  1 sibling, 0 replies; 664+ messages in thread
From: Christoph Hellwig @ 2009-03-29  8:25 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Christoph Hellwig, Theodore Tso, Jens Axboe, Linus Torvalds,
	Ingo Molnar, Alan Cox, Arjan van de Ven, Andrew Morton,
	Peter Zijlstra, Nick Piggin, David Rees, Jesper Krogh,
	Linux Kernel Mailing List

On Fri, Mar 27, 2009 at 04:38:35PM -0400, Jeff Garzik wrote:
> At the very least, IMO the block layer should be able to notice when  
> barriers need not be translated into cache flushes.  Most notably when  
> wb cache is disabled on the drive, something easy to auto-detect, but  
> probably a manual switch also, for people with enterprise battery-backed  
> storage and such.

Yeah, that's why I suggested to have the tuning knob in the block layer
and not in the fses when this came up last time.


^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-27 19:00                                                         ` Alan Cox
@ 2009-03-29  9:15                                                           ` Xavier Bestel
  2009-03-29 20:16                                                             ` Alan Cox
  0 siblings, 1 reply; 664+ messages in thread
From: Xavier Bestel @ 2009-03-29  9:15 UTC (permalink / raw)
  To: Alan Cox
  Cc: Linus Torvalds, Matthew Garrett, Theodore Tso, Andrew Morton,
	David Rees, Jesper Krogh, Linux Kernel Mailing List

Le vendredi 27 mars 2009 à 19:00 +0000, Alan Cox a écrit :
> Actually "pure sh*t" is most of the software currently written. The more
> code I read the happier I get that the lawmakers are finally sick of it
> and going to make damned sure software is subject to liability law. Boy
> will that improve things.

Alan, you're getting old.



^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-29  9:15                                                           ` Xavier Bestel
@ 2009-03-29 20:16                                                             ` Alan Cox
  2009-03-29 21:07                                                               ` Linus Torvalds
  0 siblings, 1 reply; 664+ messages in thread
From: Alan Cox @ 2009-03-29 20:16 UTC (permalink / raw)
  To: Xavier Bestel
  Cc: Linus Torvalds, Matthew Garrett, Theodore Tso, Andrew Morton,
	David Rees, Jesper Krogh, Linux Kernel Mailing List

On Sun, 29 Mar 2009 11:15:46 +0200
Xavier Bestel <xavier.bestel@free.fr> wrote:

> Le vendredi 27 mars 2009 à 19:00 +0000, Alan Cox a écrit :
> > Actually "pure sh*t" is most of the software currently written. The more
> > code I read the happier I get that the lawmakers are finally sick of it
> > and going to make damned sure software is subject to liability law. Boy
> > will that improve things.
> 
> Alan, you're getting old.

Yep and twenty years on software hasn´t improved

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-29 20:16                                                             ` Alan Cox
@ 2009-03-29 21:07                                                               ` Linus Torvalds
  2009-03-30 19:37                                                                 ` Jeremy Fitzhardinge
  0 siblings, 1 reply; 664+ messages in thread
From: Linus Torvalds @ 2009-03-29 21:07 UTC (permalink / raw)
  To: Alan Cox
  Cc: Xavier Bestel, Matthew Garrett, Theodore Tso, Andrew Morton,
	David Rees, Jesper Krogh, Linux Kernel Mailing List



On Sun, 29 Mar 2009, Alan Cox wrote:
> 
> Yep and twenty years on software hasn´t improved

I really think you're gilding the edges of those old memories. The 
software 20 years ago wasn't that great. I'd say it was on the whole a 
whole lot crappier than it is today.

It's just that we have much higher expectations, and our problem sizes 
have grown a _lot_ faster than rotating disk latencies have improved. 
People didn't worry about having a hundred megs of dirty data and doing an 
'fsync' twenty years ago. Even on big hardware (if you _had_ a hundred 
megs of dirty data you didn't worry about latencies of a few seconds), 
never mind in the Linux world.

This particular problem really largely boils down to "average memory 
capacity has expanded a _lot_ more than harddisk speeds have gone up".

			Linus

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-28 15:17                                                                           ` Mark Lord
  2009-03-28 16:08                                                                             ` Stefan Richter
  2009-03-29  1:18                                                                             ` Jeff Garzik
@ 2009-03-29 23:14                                                                             ` Dave Chinner
  2009-03-30  0:39                                                                               ` Theodore Tso
                                                                                                 ` (2 more replies)
  2 siblings, 3 replies; 664+ messages in thread
From: Dave Chinner @ 2009-03-29 23:14 UTC (permalink / raw)
  To: Mark Lord
  Cc: Stefan Richter, Jeff Garzik, Linus Torvalds, Matthew Garrett,
	Alan Cox, Theodore Tso, Andrew Morton, David Rees, Jesper Krogh,
	Linux Kernel Mailing List

On Sat, Mar 28, 2009 at 11:17:08AM -0400, Mark Lord wrote:
> The better solution seems to be the rather obvious one:
>
>   the filesystem should commit data to disk before altering metadata.

Generalities are bad. For example:

write();
unlink();
<do more stuff>
close();

This is a clear case where you want metadata changed before data is
committed to disk. In many cases, you don't even want the data to
hit the disk here.

Similarly, rsync does the magic open,write,close,rename sequence
without an fsync before the rename. And it doesn't need the fsync,
either. The proposed implicit fsync on rename will kill rsync
performance, and I think that may make many people unhappy....

> Much easier and more reliable to centralize it there, rather than
> rely (falsely) upon thousands of programs each performing numerous
> performance-killing fsync's.

The filesystem should batch the fsyncs efficiently. if the
filesystem doesn't handle fsync efficiently, then it is a bad
filesystem choice for that workload....


Cheers,
Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-29 23:14                                                                             ` Dave Chinner
@ 2009-03-30  0:39                                                                               ` Theodore Tso
  2009-03-30  1:29                                                                                 ` Trenton Adams
                                                                                                   ` (2 more replies)
  2009-03-30  3:01                                                                               ` Mark Lord
  2009-03-30 12:55                                                                               ` Chris Mason
  2 siblings, 3 replies; 664+ messages in thread
From: Theodore Tso @ 2009-03-30  0:39 UTC (permalink / raw)
  To: Mark Lord, Stefan Richter, Jeff Garzik, Linus Torvalds,
	Matthew Garrett, Alan Cox, Andrew Morton, David Rees,
	Jesper Krogh, Linux Kernel Mailing List

On Mon, Mar 30, 2009 at 10:14:51AM +1100, Dave Chinner wrote:
> This is a clear case where you want metadata changed before data is
> committed to disk. In many cases, you don't even want the data to
> hit the disk here.
> 
> Similarly, rsync does the magic open,write,close,rename sequence
> without an fsync before the rename. And it doesn't need the fsync,
> either. The proposed implicit fsync on rename will kill rsync
> performance, and I think that may make many people unhappy....

I agree.  But unfortunately, I think we're going to be bullied into
data=ordered semantics for the open/write/close/rename sequence, at
least as the default.  Ext4 has a noauto_da_alloc mount option (which
Eric Sandeen suggested we rename to "no_pony" :-), for people who
mostly run sane applications that use fsync().

For people who care about rsync's performance and who assume that they
can always restart rsync if the system crashes while the rsync is
running could, rsync could add Yet Another Rsync Option :-) which
explicitly unlinks the target file before the rename, which would
disable the implicit fsync().

> > Much easier and more reliable to centralize it there, rather than
> > rely (falsely) upon thousands of programs each performing numerous
> > performance-killing fsync's.
> 
> The filesystem should batch the fsyncs efficiently. if the
> filesystem doesn't handle fsync efficiently, then it is a bad
> filesystem choice for that workload....

All I can do is apologize to all other filesystem developers profusely
for ext3's data=ordered semantics; at this point, I very much regret
that we made data=ordered the default for ext3.  But the application
writers vastly outnumber us, and realistically we're not going to be
able to easily roll back eight years of application writers being
trained that fsync() is not necessary, and actually is detrimental for
ext3.

	       		      	       	      - Ted

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: [PATCH] issue storage dev flush from generic file_fsync helper
  2009-03-29  8:25                                       ` Christoph Hellwig
@ 2009-03-30  1:25                                         ` Fernando Luis Vázquez Cao
  2009-03-30  1:36                                           ` [PATCH 1/5] block: Add block_flush_device() Fernando Luis Vázquez Cao
                                                             ` (4 more replies)
  0 siblings, 5 replies; 664+ messages in thread
From: Fernando Luis Vázquez Cao @ 2009-03-30  1:25 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jeff Garzik, Linus Torvalds, Theodore Tso, Ingo Molnar, Alan Cox,
	Arjan van de Ven, Andrew Morton, Peter Zijlstra, Nick Piggin,
	David Rees, Jesper Krogh, Linux Kernel Mailing List

Christoph Hellwig wrote:
> Looks good except that we still need a tuning know for it.  Preferably
> one that works for these filesystems and all the existing barrier using
> ones.

Christoph, I reworked my previous fsync() patches so that what was a 
mount option to trigger a storage device writeback cache flush becomes a 
sysfs knob.

I still need to add support for automatic detection of underlying 
device's flushing capabilities but first I would like to know if you 
agree with the general approach.

I'll be replying to this email with the new patches.

- Fernando

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-30  0:39                                                                               ` Theodore Tso
@ 2009-03-30  1:29                                                                                 ` Trenton Adams
  2009-03-30  3:28                                                                                   ` Theodore Tso
  2009-03-30  6:31                                                                                 ` Dave Chinner
  2009-03-30  7:13                                                                                 ` Andreas T.Auer
  2 siblings, 1 reply; 664+ messages in thread
From: Trenton Adams @ 2009-03-30  1:29 UTC (permalink / raw)
  To: Theodore Tso, Mark Lord, Stefan Richter, Jeff Garzik,
	Linus Torvalds, Matthew Garrett, Alan Cox, Andrew Morton,
	David Rees, Jesper Krogh, Linux Kernel Mailing List

On Sun, Mar 29, 2009 at 6:39 PM, Theodore Tso <tytso@mit.edu> wrote:
> On Mon, Mar 30, 2009 at 10:14:51AM +1100, Dave Chinner wrote:
> All I can do is apologize to all other filesystem developers profusely
> for ext3's data=ordered semantics; at this point, I very much regret
> that we made data=ordered the default for ext3.  But the application
> writers vastly outnumber us, and realistically we're not going to be
> able to easily roll back eight years of application writers being
> trained that fsync() is not necessary, and actually is detrimental for
> ext3.
I am slightly confused by the "data=ordered" thing that everyone is
mentioning of late.  In theory, it made sense to me before I tried it.
 I switched to mounting my ext3 as ext4, and I'm still seeing
seriously delayed fsyncs.  Theodore, I used a modified version of your
fsync-tester.c to bench 1M writes, while doing a dd, and I'm still
getting *almost* as bad of "fsync" performance as I was on ext3.  On
ext3, the fsync would usually not finish until the dd was complete.

I am currently using Linus' tree at v2.6.29, in x86_64 mode.  If you
need more info, let me know.

tdamac ~ # mount
/dev/mapper/s-sys on / type ext4 (rw)

dd if=/dev/zero of=/tmp/bigfile bs=1M count=2000

Your modified fsync test renamed to fs-bench...
tdamac kernel-sluggish # ./fs-bench --sync
write (sync: 1) time: 0.0301
write (sync: 1) time: 0.2098
write (sync: 1) time: 0.0291
write (sync: 1) time: 0.0264
write (sync: 1) time: 1.1664
write (sync: 1) time: 4.0421
write (sync: 1) time: 4.3212
write (sync: 1) time: 3.5316
write (sync: 1) time: 18.6760
write (sync: 1) time: 3.7851
write (sync: 1) time: 13.6281
write (sync: 1) time: 19.4889
write (sync: 1) time: 15.4923
write (sync: 1) time: 7.3491
write (sync: 1) time: 0.0269
write (sync: 1) time: 0.0275
...

This topic is important to me, as it has been affecting my home
machine quite a bit.  I can test things as I have time.

Lastly, is there any way data=ordered could be re-written to be
"smart" about not making other processes wait on fsync?  Or is that
sort of thing only handled in the scheduler? (not a kernel hacker
here)

Sorry if I'm interrupting.  Perhaps I should even be starting another thread?

^ permalink raw reply	[flat|nested] 664+ messages in thread

* [PATCH 1/5] block: Add block_flush_device()
  2009-03-30  1:25                                         ` Fernando Luis Vázquez Cao
@ 2009-03-30  1:36                                           ` Fernando Luis Vázquez Cao
  2009-03-30  1:40                                           ` [PATCH 2/5] ext3: call blkdev_issue_flush on fsync() Fernando Luis Vázquez Cao
                                                             ` (3 subsequent siblings)
  4 siblings, 0 replies; 664+ messages in thread
From: Fernando Luis Vázquez Cao @ 2009-03-30  1:36 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jeff Garzik, Linus Torvalds, Theodore Tso, Ingo Molnar, Alan Cox,
	Arjan van de Ven, Andrew Morton, Peter Zijlstra, Nick Piggin,
	David Rees, Jesper Krogh, Linux Kernel Mailing List

This patch adds a helper function that should be used by filesystems that need
to flush the underlying block device on fsync()/fdatasync().

Signed-off-by: Fernando Luis Vazquez Cao <fernando@oss.ntt.co.jp>
---

diff -urNp linux-2.6.29-orig/fs/buffer.c linux-2.6.29/fs/buffer.c
--- linux-2.6.29-orig/fs/buffer.c	2009-03-24 08:12:14.000000000 +0900
+++ linux-2.6.29/fs/buffer.c	2009-03-28 20:43:51.000000000 +0900
@@ -165,6 +165,17 @@ void end_buffer_write_sync(struct buffer
  	put_bh(bh);
  }

+/* Issue flush of write caches on the block device */
+int block_flush_device(struct super_block *sb)
+{
+	int ret = 0;
+
+	ret = blkdev_issue_flush(sb->s_bdev, NULL);
+
+	return (ret == -EOPNOTSUPP) ? 0 : ret;
+}
+EXPORT_SYMBOL(block_flush_device);
+
  /*
   * Write out and wait upon all the dirty data associated with a block
   * device via its mapping.  Does not take the superblock lock.
diff -urNp linux-2.6.29-orig/include/linux/buffer_head.h linux-2.6.29/include/linux/buffer_head.h
--- linux-2.6.29-orig/include/linux/buffer_head.h	2009-03-24 08:12:14.000000000 +0900
+++ linux-2.6.29/include/linux/buffer_head.h	2009-03-28 20:43:51.000000000 +0900
@@ -238,6 +238,7 @@ int nobh_write_end(struct file *, struct
  int nobh_truncate_page(struct address_space *, loff_t, get_block_t *);
  int nobh_writepage(struct page *page, get_block_t *get_block,
                          struct writeback_control *wbc);
+int block_flush_device(struct super_block *sb);

  void buffer_init(void);


^ permalink raw reply	[flat|nested] 664+ messages in thread

* [PATCH 2/5] ext3: call blkdev_issue_flush on fsync()
  2009-03-30  1:25                                         ` Fernando Luis Vázquez Cao
  2009-03-30  1:36                                           ` [PATCH 1/5] block: Add block_flush_device() Fernando Luis Vázquez Cao
@ 2009-03-30  1:40                                           ` Fernando Luis Vázquez Cao
  2009-03-30  1:51                                             ` Jeff Garzik
  2009-03-30  1:43                                           ` [PATCH 3/5] ext4: call blkdev_issue_flush on fsync Fernando Luis Vázquez Cao
                                                             ` (2 subsequent siblings)
  4 siblings, 1 reply; 664+ messages in thread
From: Fernando Luis Vázquez Cao @ 2009-03-30  1:40 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jeff Garzik, Linus Torvalds, Theodore Tso, Ingo Molnar, Alan Cox,
	Arjan van de Ven, Andrew Morton, Peter Zijlstra, Nick Piggin,
	David Rees, Jesper Krogh, Linux Kernel Mailing List

To ensure that bits are truly on-disk after an fsync or fdatasync, we
should force a disk flush explicitly when there is dirty data/metadata
and the journal didn't emit a write barrier (either because metadata is
not being synched or barriers are disabled).

Signed-off-by: Fernando Luis Vazquez Cao <fernando@oss.ntt.co.jp>
---

diff -urNp linux-2.6.29-orig/fs/ext3/fsync.c linux-2.6.29/fs/ext3/fsync.c
--- linux-2.6.29-orig/fs/ext3/fsync.c	2009-03-24 08:12:14.000000000 +0900
+++ linux-2.6.29/fs/ext3/fsync.c	2009-03-28 20:45:40.000000000 +0900
@@ -45,6 +45,8 @@
  int ext3_sync_file(struct file * file, struct dentry *dentry, int datasync)
  {
  	struct inode *inode = dentry->d_inode;
+	journal_t *journal = EXT3_SB(inode->i_sb)->s_journal;
+	unsigned long i_state = inode->i_state;
  	int ret = 0;

  	J_ASSERT(ext3_journal_current_handle() == NULL);
@@ -69,23 +71,30 @@ int ext3_sync_file(struct file * file, s
  	 */
  	if (ext3_should_journal_data(inode)) {
  		ret = ext3_force_commit(inode->i_sb);
-		goto out;
+		if (!(journal->j_flags & JFS_BARRIER))
+			block_flush_device(inode->i_sb);
+		return ret;
  	}

-	if (datasync && !(inode->i_state & I_DIRTY_DATASYNC))
-		goto out;
+	if (datasync && !(i_state & I_DIRTY_DATASYNC)) {
+		if (i_state & I_DIRTY_PAGES)
+			block_flush_device(inode->i_sb);
+		return ret;
+	}

  	/*
  	 * The VFS has written the file data.  If the inode is unaltered
  	 * then we need not start a commit.
  	 */
-	if (inode->i_state & (I_DIRTY_SYNC|I_DIRTY_DATASYNC)) {
+	if (i_state & (I_DIRTY_SYNC|I_DIRTY_DATASYNC)) {
  		struct writeback_control wbc = {
  			.sync_mode = WB_SYNC_ALL,
  			.nr_to_write = 0, /* sys_fsync did this */
  		};
  		ret = sync_inode(inode, &wbc);
+		if (journal && !(journal->j_flags & JFS_BARRIER))
+			block_flush_device(inode->i_sb);
  	}
-out:
+
  	return ret;
  }

^ permalink raw reply	[flat|nested] 664+ messages in thread

* [PATCH 3/5] ext4: call blkdev_issue_flush on fsync
  2009-03-30  1:25                                         ` Fernando Luis Vázquez Cao
  2009-03-30  1:36                                           ` [PATCH 1/5] block: Add block_flush_device() Fernando Luis Vázquez Cao
  2009-03-30  1:40                                           ` [PATCH 2/5] ext3: call blkdev_issue_flush on fsync() Fernando Luis Vázquez Cao
@ 2009-03-30  1:43                                           ` Fernando Luis Vázquez Cao
  2009-03-30  1:53                                           ` [PATCH 4/5] vfs: call blkdev_issue_flush from generic file_fsync helper Fernando Luis Vázquez Cao
  2009-03-30  1:59                                           ` [PATCH 5/5] vfs: Add wbcflush sysfs knob to disable storage device writeback cache flushes Fernando Luis Vázquez Cao
  4 siblings, 0 replies; 664+ messages in thread
From: Fernando Luis Vázquez Cao @ 2009-03-30  1:43 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jeff Garzik, Linus Torvalds, Theodore Tso, Ingo Molnar, Alan Cox,
	Arjan van de Ven, Andrew Morton, Peter Zijlstra, Nick Piggin,
	David Rees, Jesper Krogh, Linux Kernel Mailing List

To ensure that bits are truly on-disk after an fsync or fdatasync, we
should force a disk flush explicitly when there is dirty data/metadata
and the journal didn't emit a write barrier (either because metadata is
not being synched or barriers are disabled).

Signed-off-by: Fernando Luis Vazquez Cao <fernando@oss.ntt.co.jp>
---

diff -urNp linux-2.6.29-orig/fs/ext4/fsync.c linux-2.6.29/fs/ext4/fsync.c
--- linux-2.6.29-orig/fs/ext4/fsync.c	2009-03-24 08:12:14.000000000 +0900
+++ linux-2.6.29/fs/ext4/fsync.c	2009-03-28 20:48:17.000000000 +0900
@@ -48,6 +48,7 @@ int ext4_sync_file(struct file *file, st
  {
  	struct inode *inode = dentry->d_inode;
  	journal_t *journal = EXT4_SB(inode->i_sb)->s_journal;
+	unsigned long i_state = inode->i_state;
  	int ret = 0;

  	J_ASSERT(ext4_journal_current_handle() == NULL);
@@ -76,25 +77,30 @@ int ext4_sync_file(struct file *file, st
  	 */
  	if (ext4_should_journal_data(inode)) {
  		ret = ext4_force_commit(inode->i_sb);
-		goto out;
+		if (!(journal->j_flags & JBD2_BARRIER))
+			block_flush_device(inode->i_sb);
+		return ret;
  	}

-	if (datasync && !(inode->i_state & I_DIRTY_DATASYNC))
-		goto out;
+	if (datasync && !(i_state & I_DIRTY_DATASYNC)) {
+		if (i_state & I_DIRTY_PAGES)
+			block_flush_device(inode->i_sb);
+		return ret;
+	}

  	/*
  	 * The VFS has written the file data.  If the inode is unaltered
  	 * then we need not start a commit.
  	 */
-	if (inode->i_state & (I_DIRTY_SYNC|I_DIRTY_DATASYNC)) {
+	if (i_state & (I_DIRTY_SYNC|I_DIRTY_DATASYNC)) {
  		struct writeback_control wbc = {
  			.sync_mode = WB_SYNC_ALL,
  			.nr_to_write = 0, /* sys_fsync did this */
  		};
  		ret = sync_inode(inode, &wbc);
-		if (journal && (journal->j_flags & JBD2_BARRIER))
-			blkdev_issue_flush(inode->i_sb->s_bdev, NULL);
+		if (journal && !(journal->j_flags & JBD2_BARRIER))
+			block_flush_device(inode->i_sb);
  	}
-out:
+
  	return ret;
  }

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: [PATCH 2/5] ext3: call blkdev_issue_flush on fsync()
  2009-03-30  1:40                                           ` [PATCH 2/5] ext3: call blkdev_issue_flush on fsync() Fernando Luis Vázquez Cao
@ 2009-03-30  1:51                                             ` Jeff Garzik
  2009-03-30  2:50                                               ` Fernando Luis Vázquez Cao
  0 siblings, 1 reply; 664+ messages in thread
From: Jeff Garzik @ 2009-03-30  1:51 UTC (permalink / raw)
  To: Fernando Luis Vázquez Cao
  Cc: Christoph Hellwig, Linus Torvalds, Theodore Tso, Ingo Molnar,
	Alan Cox, Arjan van de Ven, Andrew Morton, Peter Zijlstra,
	Nick Piggin, David Rees, Jesper Krogh, Linux Kernel Mailing List

Fernando Luis Vázquez Cao wrote:
> To ensure that bits are truly on-disk after an fsync or fdatasync, we
> should force a disk flush explicitly when there is dirty data/metadata
> and the journal didn't emit a write barrier (either because metadata is
> not being synched or barriers are disabled).
> 
> Signed-off-by: Fernando Luis Vazquez Cao <fernando@oss.ntt.co.jp>

Your patches do not seem to propagate the issue-flush error code, even 
when it is easily available.

	Jeff




^ permalink raw reply	[flat|nested] 664+ messages in thread

* [PATCH 4/5] vfs: call blkdev_issue_flush from generic file_fsync helper
  2009-03-30  1:25                                         ` Fernando Luis Vázquez Cao
                                                             ` (2 preceding siblings ...)
  2009-03-30  1:43                                           ` [PATCH 3/5] ext4: call blkdev_issue_flush on fsync Fernando Luis Vázquez Cao
@ 2009-03-30  1:53                                           ` Fernando Luis Vázquez Cao
  2009-03-30  1:59                                           ` [PATCH 5/5] vfs: Add wbcflush sysfs knob to disable storage device writeback cache flushes Fernando Luis Vázquez Cao
  4 siblings, 0 replies; 664+ messages in thread
From: Fernando Luis Vázquez Cao @ 2009-03-30  1:53 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jeff Garzik, Linus Torvalds, Theodore Tso, Ingo Molnar, Alan Cox,
	Arjan van de Ven, Andrew Morton, Peter Zijlstra, Nick Piggin,
	David Rees, Jesper Krogh, Linux Kernel Mailing List

To ensure that bits are truly on-disk after an fsync or fdatasync
we should force a disk flush explicitly. This is necessary to
have data integrity guarantees in filesystems such as FAT which
do not provide their own fsync implementation and use the vfs
helper instead.

Signed-off-by: Fernando Luis Vazquez Cao <fernando@oss.ntt.co.jp>
---

diff -urNp linux-2.6.29-orig/fs/sync.c linux-2.6.29/fs/sync.c
--- linux-2.6.29-orig/fs/sync.c	2009-03-24 08:12:14.000000000 +0900
+++ linux-2.6.29/fs/sync.c	2009-03-28 20:58:54.000000000 +0900
@@ -72,6 +72,11 @@ int file_fsync(struct file *filp, struct
  	err = sync_blockdev(sb->s_bdev);
  	if (!ret)
  		ret = err;
+
+	err = block_flush_device(sb);
+	if (!ret)
+		ret = err;
+
  	return ret;
  }


^ permalink raw reply	[flat|nested] 664+ messages in thread

* [PATCH 5/5] vfs: Add  wbcflush sysfs knob to disable storage device writeback cache flushes
  2009-03-30  1:25                                         ` Fernando Luis Vázquez Cao
                                                             ` (3 preceding siblings ...)
  2009-03-30  1:53                                           ` [PATCH 4/5] vfs: call blkdev_issue_flush from generic file_fsync helper Fernando Luis Vázquez Cao
@ 2009-03-30  1:59                                           ` Fernando Luis Vázquez Cao
  4 siblings, 0 replies; 664+ messages in thread
From: Fernando Luis Vázquez Cao @ 2009-03-30  1:59 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jeff Garzik, Linus Torvalds, Theodore Tso, Ingo Molnar, Alan Cox,
	Arjan van de Ven, Andrew Morton, Peter Zijlstra, Nick Piggin,
	David Rees, Jesper Krogh, Linux Kernel Mailing List

Add a sysfs knob to disable storage device writeback cache flushes.

Signed-off-by: Fernando Luis Vazquez Cao <fernando@oss.ntt.co.jp>
---

diff -urNp linux-2.6.29-orig/block/blk-barrier.c linux-2.6.29/block/blk-barrier.c
--- linux-2.6.29-orig/block/blk-barrier.c	2009-03-24 08:12:14.000000000 +0900
+++ linux-2.6.29/block/blk-barrier.c	2009-03-29 17:55:45.000000000 +0900
@@ -318,6 +318,9 @@ int blkdev_issue_flush(struct block_devi
  	if (!q)
  		return -ENXIO;

+	if (blk_queue_nowbcflush(q))
+		return -EOPNOTSUPP;
+
  	bio = bio_alloc(GFP_KERNEL, 0);
  	if (!bio)
  		return -ENOMEM;
diff -urNp linux-2.6.29-orig/block/blk-core.c linux-2.6.29/block/blk-core.c
--- linux-2.6.29-orig/block/blk-core.c	2009-03-24 08:12:14.000000000 +0900
+++ linux-2.6.29/block/blk-core.c	2009-03-29 18:09:18.000000000 +0900
@@ -1452,7 +1452,8 @@ static inline void __generic_make_reques
  			goto end_io;
  		}
  		if (bio_barrier(bio) && bio_has_data(bio) &&
-		    (q->next_ordered == QUEUE_ORDERED_NONE)) {
+		    (blk_queue_nowbcflush(q) ||
+		     q->next_ordered == QUEUE_ORDERED_NONE)) {
  			err = -EOPNOTSUPP;
  			goto end_io;
  		}
diff -urNp linux-2.6.29-orig/block/blk-sysfs.c linux-2.6.29/block/blk-sysfs.c
--- linux-2.6.29-orig/block/blk-sysfs.c	2009-03-24 08:12:14.000000000 +0900
+++ linux-2.6.29/block/blk-sysfs.c	2009-03-30 10:21:38.000000000 +0900
@@ -151,6 +151,27 @@ static ssize_t queue_nonrot_store(struct
  	return ret;
  }

+static ssize_t queue_wbcflush_show(struct request_queue *q, char *page)
+{
+	return queue_var_show(!blk_queue_nowbcflush(q), page);
+}
+
+static ssize_t queue_wbcflush_store(struct request_queue *q, const char *page,
+				    size_t count)
+{
+	unsigned long nm;
+	ssize_t ret = queue_var_store(&nm, page, count);
+
+	spin_lock_irq(q->queue_lock);
+	if (nm)
+		queue_flag_clear(QUEUE_FLAG_NOWBCFLUSH , q);
+	else
+		queue_flag_set(QUEUE_FLAG_NOWBCFLUSH , q);
+	spin_unlock_irq(q->queue_lock);
+
+	return ret;
+}
+
  static ssize_t queue_nomerges_show(struct request_queue *q, char *page)
  {
  	return queue_var_show(blk_queue_nomerges(q), page);
@@ -258,6 +279,12 @@ static struct queue_sysfs_entry queue_no
  	.store = queue_nonrot_store,
  };

+static struct queue_sysfs_entry queue_wbcflush_entry = {
+	.attr = {.name = "wbcflush", .mode = S_IRUGO | S_IWUSR },
+	.show = queue_wbcflush_show,
+	.store = queue_wbcflush_store,
+};
+
  static struct queue_sysfs_entry queue_nomerges_entry = {
  	.attr = {.name = "nomerges", .mode = S_IRUGO | S_IWUSR },
  	.show = queue_nomerges_show,
@@ -284,6 +311,7 @@ static struct attribute *default_attrs[]
  	&queue_iosched_entry.attr,
  	&queue_hw_sector_size_entry.attr,
  	&queue_nonrot_entry.attr,
+	&queue_wbcflush_entry.attr,
  	&queue_nomerges_entry.attr,
  	&queue_rq_affinity_entry.attr,
  	&queue_iostats_entry.attr,
diff -urNp linux-2.6.29-orig/include/linux/blkdev.h linux-2.6.29/include/linux/blkdev.h
--- linux-2.6.29-orig/include/linux/blkdev.h	2009-03-24 08:12:14.000000000 +0900
+++ linux-2.6.29/include/linux/blkdev.h	2009-03-30 10:23:34.000000000 +0900
@@ -452,6 +452,7 @@ struct request_queue
  #define QUEUE_FLAG_NONROT      14	/* non-rotational device (SSD) */
  #define QUEUE_FLAG_VIRT        QUEUE_FLAG_NONROT /* paravirt device */
  #define QUEUE_FLAG_IO_STAT     15	/* do IO stats */
+#define QUEUE_FLAG_NOWBCFLUSH  16	/* disable write-back cache flushing */

  #define QUEUE_FLAG_DEFAULT	((1 << QUEUE_FLAG_IO_STAT) |		\
  				 (1 << QUEUE_FLAG_CLUSTER) |		\
@@ -572,6 +573,8 @@ enum {
  #define blk_queue_stopped(q)	test_bit(QUEUE_FLAG_STOPPED, &(q)->queue_flags)
  #define blk_queue_nomerges(q)	test_bit(QUEUE_FLAG_NOMERGES, &(q)->queue_flags)
  #define blk_queue_nonrot(q)	test_bit(QUEUE_FLAG_NONROT, &(q)->queue_flags)
+#define blk_queue_nowbcflush(q)	\
+	test_bit(QUEUE_FLAG_NOWBCFLUSH, &(q)->queue_flags)
  #define blk_queue_io_stat(q)	test_bit(QUEUE_FLAG_IO_STAT, &(q)->queue_flags)
  #define blk_queue_flushing(q)	((q)->ordseq)
  #define blk_queue_stackable(q)	\

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: [PATCH 2/5] ext3: call blkdev_issue_flush on fsync()
  2009-03-30  1:51                                             ` Jeff Garzik
@ 2009-03-30  2:50                                               ` Fernando Luis Vázquez Cao
  2009-03-30 12:04                                                 ` Fernando Luis Vázquez Cao
  0 siblings, 1 reply; 664+ messages in thread
From: Fernando Luis Vázquez Cao @ 2009-03-30  2:50 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Christoph Hellwig, Linus Torvalds, Theodore Tso, Ingo Molnar,
	Alan Cox, Arjan van de Ven, Andrew Morton, Peter Zijlstra,
	Nick Piggin, David Rees, Jesper Krogh, Linux Kernel Mailing List

Jeff Garzik wrote:
> Fernando Luis Vázquez Cao wrote:
>> To ensure that bits are truly on-disk after an fsync or fdatasync, we
>> should force a disk flush explicitly when there is dirty data/metadata
>> and the journal didn't emit a write barrier (either because metadata is
>> not being synched or barriers are disabled).
>>
>> Signed-off-by: Fernando Luis Vazquez Cao <fernando@oss.ntt.co.jp>
> 
> Your patches do not seem to propagate the issue-flush error code, even 
> when it is easily available.

Oops... you are right. I will fix that.

Thanks!

- Fernando


^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-29 23:14                                                                             ` Dave Chinner
  2009-03-30  0:39                                                                               ` Theodore Tso
@ 2009-03-30  3:01                                                                               ` Mark Lord
  2009-03-30  6:41                                                                                 ` Andreas T.Auer
  2009-03-30 12:55                                                                               ` Chris Mason
  2 siblings, 1 reply; 664+ messages in thread
From: Mark Lord @ 2009-03-30  3:01 UTC (permalink / raw)
  To: Stefan Richter, Jeff Garzik, Linus Torvalds, Matthew Garrett,
	Alan Cox, Theodore Tso, Andrew Morton, David Rees, Jesper Krogh,
	Linux Kernel Mailing List

Dave Chinner wrote:
> On Sat, Mar 28, 2009 at 11:17:08AM -0400, Mark Lord wrote:
>> The better solution seems to be the rather obvious one:
>>
>>   the filesystem should commit data to disk before altering metadata.
> 
> Generalities are bad. For example:
> 
> write();
> unlink();
> <do more stuff>
> close();
> 
> This is a clear case where you want metadata changed before data is
> committed to disk. In many cases, you don't even want the data to
> hit the disk here.
..

Err, no actually.  I want a consistent disk state,
either all old or all new data after a crash.

Not loss of BOTH new and old data.

And the example above is trying to show, what??
Looks like a temporary file case, except the code
is buggy and should be doing the unlink() before
the write() call.

But thanks for looking at this stuff!

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-30  1:29                                                                                 ` Trenton Adams
@ 2009-03-30  3:28                                                                                   ` Theodore Tso
  2009-03-30  3:55                                                                                     ` Trenton D. Adams
  0 siblings, 1 reply; 664+ messages in thread
From: Theodore Tso @ 2009-03-30  3:28 UTC (permalink / raw)
  To: Trenton Adams
  Cc: Mark Lord, Stefan Richter, Jeff Garzik, Linus Torvalds,
	Matthew Garrett, Alan Cox, Andrew Morton, David Rees,
	Jesper Krogh, Linux Kernel Mailing List

On Sun, Mar 29, 2009 at 07:29:09PM -0600, Trenton Adams wrote:
> I am slightly confused by the "data=ordered" thing that everyone is
> mentioning of late.  In theory, it made sense to me before I tried it.
>  I switched to mounting my ext3 as ext4, and I'm still seeing
> seriously delayed fsyncs.  Theodore, I used a modified version of your
> fsync-tester.c to bench 1M writes, while doing a dd, and I'm still
> getting *almost* as bad of "fsync" performance as I was on ext3.  On
> ext3, the fsync would usually not finish until the dd was complete.

How much memory do you have?  On my 4gig X61 laptop, using a 5400 rpm
laptop drive, I see typical times of 1 to 1.5 seconds, with a few
outliers at 4-5 seconds.  With ext3, the fsync times immediately
jumped up to 6-8 seconds, with the outliers in the 13-15 second range.

(This is with a filesystem formated as ext3, and mounted as either
ext3 or ext4; if the filesystem is formatted using "mke2fs -t ext4",
what you see is a very smooth 1.2-1.5 seconds fsync latency, indirect
blocks for very big files end up being quite inefficient.)

So I'm seeing a definite difference --- but also please remember that
"dd if=/dev/zero of=bigzero.img" really is an unfair, worst-case
scenario, since you are dirtying memory as fast as your CPU will dirty
pages.  Normally, even if you are running distcc, the rate at which
you can dirty pages will be throttled at your local network speed.

You might want to try more normal workloads and see whether you are
seeing distinct fsync latency differences with ext4.  Even with the
worst-case dd if=/dev/zero, I'm seeing major differences in my
testing.

							- Ted

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-30  3:28                                                                                   ` Theodore Tso
@ 2009-03-30  3:55                                                                                     ` Trenton D. Adams
  2009-03-30 13:45                                                                                       ` Theodore Tso
  0 siblings, 1 reply; 664+ messages in thread
From: Trenton D. Adams @ 2009-03-30  3:55 UTC (permalink / raw)
  To: Theodore Tso, Trenton Adams, Mark Lord, Stefan Richter,
	Jeff Garzik, Linus Torvalds, Matthew Garrett, Alan Cox,
	Andrew Morton, David Rees, Jesper Krogh,
	Linux Kernel Mailing List

On Sun, Mar 29, 2009 at 9:28 PM, Theodore Tso <tytso@mit.edu> wrote:
> On Sun, Mar 29, 2009 at 07:29:09PM -0600, Trenton Adams wrote:
>> I am slightly confused by the "data=ordered" thing that everyone is
>> mentioning of late.  In theory, it made sense to me before I tried it.
>>  I switched to mounting my ext3 as ext4, and I'm still seeing
>> seriously delayed fsyncs.  Theodore, I used a modified version of your
>> fsync-tester.c to bench 1M writes, while doing a dd, and I'm still
>> getting *almost* as bad of "fsync" performance as I was on ext3.  On
>> ext3, the fsync would usually not finish until the dd was complete.
>
> How much memory do you have?  On my 4gig X61 laptop, using a 5400 rpm
> laptop drive, I see typical times of 1 to 1.5 seconds, with a few
> outliers at 4-5 seconds.  With ext3, the fsync times immediately
> jumped up to 6-8 seconds, with the outliers in the 13-15 second range.

2G, and I believe 5400rpm

>
> (This is with a filesystem formated as ext3, and mounted as either
> ext3 or ext4; if the filesystem is formatted using "mke2fs -t ext4",
> what you see is a very smooth 1.2-1.5 seconds fsync latency, indirect
> blocks for very big files end up being quite inefficient.)

Oh.  I thought I had read somewhere that mounting ext4 over ext3 would
solve the problem.  Not sure where I read that now.  Sorry for wasting
your time.

>
> So I'm seeing a definite difference --- but also please remember that
> "dd if=/dev/zero of=bigzero.img" really is an unfair, worst-case
> scenario, since you are dirtying memory as fast as your CPU will dirty
> pages.  Normally, even if you are running distcc, the rate at which
> you can dirty pages will be throttled at your local network speed.

Yes, I realize that.  When trying to find performance problems I try
to be as *unfair* as possible. :D

Thanks Ted.

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-30  0:39                                                                               ` Theodore Tso
  2009-03-30  1:29                                                                                 ` Trenton Adams
@ 2009-03-30  6:31                                                                                 ` Dave Chinner
  2009-03-30 13:55                                                                                   ` Theodore Tso
  2009-03-30  7:13                                                                                 ` Andreas T.Auer
  2 siblings, 1 reply; 664+ messages in thread
From: Dave Chinner @ 2009-03-30  6:31 UTC (permalink / raw)
  To: Theodore Tso, Mark Lord, Stefan Richter, Jeff Garzik,
	Linus Torvalds, Matthew Garrett, Alan Cox, Andrew Morton,
	David Rees, Jesper Krogh, Linux Kernel Mailing List

On Sun, Mar 29, 2009 at 08:39:48PM -0400, Theodore Tso wrote:
> On Mon, Mar 30, 2009 at 10:14:51AM +1100, Dave Chinner wrote:
> > This is a clear case where you want metadata changed before data is
> > committed to disk. In many cases, you don't even want the data to
> > hit the disk here.
> > 
> > Similarly, rsync does the magic open,write,close,rename sequence
> > without an fsync before the rename. And it doesn't need the fsync,
> > either. The proposed implicit fsync on rename will kill rsync
> > performance, and I think that may make many people unhappy....
> 
> I agree.  But unfortunately, I think we're going to be bullied into
> data=ordered semantics for the open/write/close/rename sequence, at
> least as the default.  Ext4 has a noauto_da_alloc mount option (which
> Eric Sandeen suggested we rename to "no_pony" :-), for people who
> mostly run sane applications that use fsync().
>
> For people who care about rsync's performance and who assume that they
> can always restart rsync if the system crashes while the rsync is
> running could, rsync could add Yet Another Rsync Option :-) which
> explicitly unlinks the target file before the rename, which would
> disable the implicit fsync().

Pardon my french, but that is a fucking joke.

You are making a judgement call that one application is more
important than another application and trying to impose that on
everyone. You are saying that we should perturb a well designed and
written backup application that is embedded into critical scripts
all around the world for the sake of desktop application that has
developers that are too fucking lazy to fix their bugs.

If you want to trade rsync performance for desktop performance, do
it in the filesystem that is aimed at the desktop. Don't fuck rename
up for filesystems that are aimed at the server market and don't
want to implement performance sucking hacks to work around fucked up
desktop applications.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-30  3:01                                                                               ` Mark Lord
@ 2009-03-30  6:41                                                                                 ` Andreas T.Auer
  0 siblings, 0 replies; 664+ messages in thread
From: Andreas T.Auer @ 2009-03-30  6:41 UTC (permalink / raw)
  To: Mark Lord
  Cc: Stefan Richter, Jeff Garzik, Linus Torvalds, Matthew Garrett,
	Alan Cox, Theodore Tso, Andrew Morton, David Rees, Jesper Krogh,
	Linux Kernel Mailing List



On 30.03.2009 05:01 Mark Lord wrote:
> Dave Chinner wrote:
>> On Sat, Mar 28, 2009 at 11:17:08AM -0400, Mark Lord wrote:
>>> The better solution seems to be the rather obvious one:
>>>
>>>   the filesystem should commit data to disk before altering metadata.
>>
>> Generalities are bad. For example:
>>
>> write();
>> unlink();
>> <do more stuff>
>> close();
>>
>> This is a clear case where you want metadata changed before data is
>> committed to disk. In many cases, you don't even want the data to
>> hit the disk here.
> ..
>
> Err, no actually.  I want a consistent disk state,
> either all old or all new data after a crash.
>
>

Dave is right that if you write to a file and unlink the same file, so
that the data are orphaned. In that case you don't want the orphaned
data to be written on disk. But Mark is right, too. Because in that case
you probably also don't want any metadata to be written to the disk,
unless the open() was already commited. You might have to update
timestamps for the directory.
So rephrasing it:
The filesystem should not alter the metadata before writing the _linked_
data.



^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-30  0:39                                                                               ` Theodore Tso
  2009-03-30  1:29                                                                                 ` Trenton Adams
  2009-03-30  6:31                                                                                 ` Dave Chinner
@ 2009-03-30  7:13                                                                                 ` Andreas T.Auer
  2009-03-30  9:05                                                                                   ` Alan Cox
  2009-03-30 19:02                                                                                   ` Bill Davidsen
  2 siblings, 2 replies; 664+ messages in thread
From: Andreas T.Auer @ 2009-03-30  7:13 UTC (permalink / raw)
  To: Theodore Tso, Mark Lord, Stefan Richter, Jeff Garzik,
	Linus Torvalds, Matthew Garrett, Alan Cox, Andrew Morton,
	David Rees, Jesper Krogh, Linux Kernel Mailing List


On 30.03.2009 02:39 Theodore Tso wrote:
> All I can do is apologize to all other filesystem developers profusely
> for ext3's data=ordered semantics; at this point, I very much regret
> that we made data=ordered the default for ext3.  But the application
> writers vastly outnumber us, and realistically we're not going to be
> able to easily roll back eight years of application writers being
> trained that fsync() is not necessary, and actually is detrimental for
> ext3.
>
>   
It seems you still didn't get the point. ext3 data=ordered is not the
problem. The problem is that the average developer doesn't expect the fs
to _re-order_ stuff. This is how most common fs did work long before
ext3 has been introduced. They just know that there is a caching and
they might lose recent data, but they expect the fs on disk to be a
snapshot of the fs in memory at some time before the crash (except when
crashing while writing). But the re-ordering brings it to the state that
never has been in memory. data=ordered is just reflecting this thinking.
With data=writeback as the default the users would have lost data and
would have simply chosen a different fs instead of twisting the params.
Or the distros would have made data=ordered the default to prevent
beeing blamed for the data loss.

And still I don't know any reason, why it makes sense to write the
metadata to non-existing data immediately instead of delaying that, too.


^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-30  7:13                                                                                 ` Andreas T.Auer
@ 2009-03-30  9:05                                                                                   ` Alan Cox
  2009-03-30 10:49                                                                                     ` Andreas T.Auer
  2009-03-30 19:02                                                                                   ` Bill Davidsen
  1 sibling, 1 reply; 664+ messages in thread
From: Alan Cox @ 2009-03-30  9:05 UTC (permalink / raw)
  To: Andreas T.Auer
  Cc: Theodore Tso, Mark Lord, Stefan Richter, Jeff Garzik,
	Linus Torvalds, Matthew Garrett, Andrew Morton, David Rees,
	Jesper Krogh, Linux Kernel Mailing List

> It seems you still didn't get the point. ext3 data=ordered is not the
> problem. The problem is that the average developer doesn't expect the fs
> to _re-order_ stuff. This is how most common fs did work long before

No it isn´t. Standard Unix file systems made no such guarantee and would
write out data out of order. The disk scheduler would then further
re-order things.

If you think the ¨guarantees¨ from before ext3 are normal defaults you´ve
been writing junk code


^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-30 22:07           ` Arjan van de Ven
@ 2009-03-30 10:18             ` Pavel Machek
  2009-03-31 13:33             ` Rafael J. Wysocki
                               ` (2 subsequent siblings)
  3 siblings, 0 replies; 664+ messages in thread
From: Pavel Machek @ 2009-03-30 10:18 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Hans-Peter Jansen, Linus Torvalds, Mike Galbraith,
	Geert Uytterhoeven, linux-kernel

Hi!

>> I always wonder, why Arjan does not intervene for his kerneloops.org  
>> project, since your approach opens a window of uncertainty during the 
>> merge window when simply using git as an efficient fetch tool.
>
> I would *love* it if Linus would, as first commit mark his tree as "-git0"
> (as per snapshots) or "-rc0". So that I can split the "final" versus
> "merge window" oopses.

Pretty please... I keep kernel binaries around and being able to tell
 what it is when it boots is useful. 
	
								Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-30  9:05                                                                                   ` Alan Cox
@ 2009-03-30 10:49                                                                                     ` Andreas T.Auer
  2009-03-30 11:12                                                                                       ` Alan Cox
  2009-03-30 11:17                                                                                       ` Ric Wheeler
  0 siblings, 2 replies; 664+ messages in thread
From: Andreas T.Auer @ 2009-03-30 10:49 UTC (permalink / raw)
  To: Alan Cox
  Cc: Andreas T.Auer, Theodore Tso, Mark Lord, Stefan Richter,
	Jeff Garzik, Linus Torvalds, Matthew Garrett, Andrew Morton,
	David Rees, Jesper Krogh, Linux Kernel Mailing List



On 30.03.2009 11:05 Alan Cox wrote:
>> It seems you still didn't get the point. ext3 data=ordered is not the
>> problem. The problem is that the average developer doesn't expect the fs
>> to _re-order_ stuff. This is how most common fs did work long before
>>     
>
> No it isn´t. Standard Unix file systems made no such guarantee and would
> write out data out of order. The disk scheduler would then further
> re-order things.
>
>   
You surely know that better: Did fs actually write "later" data quite
long before "earlier" data? During the flush data may be re-ordered, but
was it also _done_ outside of it?

> If you think the ¨guarantees¨ from before ext3 are normal defaults you´ve
> been writing junk code
>
>   
I'm still on ReiserFS since it was considered stable in some SuSE 7.x.
And I expected it to be fairly ordered, but as a network protocol
programmer I didn't rely on the ordering of fs write-outs yet.

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-30 10:49                                                                                     ` Andreas T.Auer
@ 2009-03-30 11:12                                                                                       ` Alan Cox
  2009-03-30 11:17                                                                                       ` Ric Wheeler
  1 sibling, 0 replies; 664+ messages in thread
From: Alan Cox @ 2009-03-30 11:12 UTC (permalink / raw)
  To: Andreas T.Auer
  Cc: Andreas T.Auer, Theodore Tso, Mark Lord, Stefan Richter,
	Jeff Garzik, Linus Torvalds, Matthew Garrett, Andrew Morton,
	David Rees, Jesper Krogh, Linux Kernel Mailing List

> You surely know that better: Did fs actually write "later" data quite
> long before "earlier" data? During the flush data may be re-ordered, but
> was it also _done_ outside of it?

BSD FFS/UFS and earlier file systems could leave you with all sorts of
ordering that was not guaranteed - you did get data written within about
30 seconds but no order guarantees and a crash/fsck could give you
interesting partial updates .. really interesting.

renaming was one fairly safe case as BSD FFS/UFS did rename synchronously
for the most part.


^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-30 10:49                                                                                     ` Andreas T.Auer
  2009-03-30 11:12                                                                                       ` Alan Cox
@ 2009-03-30 11:17                                                                                       ` Ric Wheeler
  2009-03-30 13:48                                                                                         ` Mark Lord
  2009-03-30 15:34                                                                                         ` Linus Torvalds
  1 sibling, 2 replies; 664+ messages in thread
From: Ric Wheeler @ 2009-03-30 11:17 UTC (permalink / raw)
  To: Andreas T.Auer
  Cc: Alan Cox, Theodore Tso, Mark Lord, Stefan Richter, Jeff Garzik,
	Linus Torvalds, Matthew Garrett, Andrew Morton, David Rees,
	Jesper Krogh, Linux Kernel Mailing List

Andreas T.Auer wrote:
> On 30.03.2009 11:05 Alan Cox wrote:
>   
>>> It seems you still didn't get the point. ext3 data=ordered is not the
>>> problem. The problem is that the average developer doesn't expect the fs
>>> to _re-order_ stuff. This is how most common fs did work long before
>>>     
>>>       
>> No it isn´t. Standard Unix file systems made no such guarantee and would
>> write out data out of order. The disk scheduler would then further
>> re-order things.
>>
>>   
>>     
> You surely know that better: Did fs actually write "later" data quite
> long before "earlier" data? During the flush data may be re-ordered, but
> was it also _done_ outside of it?
>   

People keep forgetting that storage (even on your commodity s-ata class 
of drives) has very large & volatile cache. The disk firmware can hold 
writes in that cache as long as it wants, reorder its writes into 
anything that makes sense and has no explicit ordering promises.

This is where the write barrier code comes in - for file systems that 
care about ordering for data, we use barrier ops to impose the required 
ordering.

In a similar way, fsync() gives applications the power to impose their 
own ordering.

If we assume that we can "save" an fsync cost with ordering mode, we 
have to keep in mind that the file system will need to do the expensive 
cache flushes in order to preserve its internal ordering.
>   
>> If you think the ¨guarantees¨ from before ext3 are normal defaults you´ve
>> been writing junk code
>>
>>   
>>     
> I'm still on ReiserFS since it was considered stable in some SuSE 7.x.
> And I expected it to be fairly ordered, but as a network protocol
> programmer I didn't rely on the ordering of fs write-outs yet.
>   

With reiserfs, you will have barriers on by default in SLES/opensuse 
which will keep (at least fs meta-data) properly ordered....

ric


^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: [PATCH 2/5] ext3: call blkdev_issue_flush on fsync()
  2009-03-30  2:50                                               ` Fernando Luis Vázquez Cao
@ 2009-03-30 12:04                                                 ` Fernando Luis Vázquez Cao
  2009-03-30 12:09                                                   ` [PATCH 1/7] block: Add block_flush_device() Fernando Luis Vázquez Cao
                                                                     ` (6 more replies)
  0 siblings, 7 replies; 664+ messages in thread
From: Fernando Luis Vázquez Cao @ 2009-03-30 12:04 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Christoph Hellwig, Linus Torvalds, Theodore Tso, Ingo Molnar,
	Alan Cox, Arjan van de Ven, Andrew Morton, Peter Zijlstra,
	Nick Piggin, David Rees, Jesper Krogh, Linux Kernel Mailing List,
	chris.mason, david, tj

Fernando Luis Vázquez Cao wrote:
> Jeff Garzik wrote:
>> Fernando Luis Vázquez Cao wrote:
>>> To ensure that bits are truly on-disk after an fsync or fdatasync, we
>>> should force a disk flush explicitly when there is dirty data/metadata
>>> and the journal didn't emit a write barrier (either because metadata is
>>> not being synched or barriers are disabled).
>>>
>>> Signed-off-by: Fernando Luis Vazquez Cao <fernando@oss.ntt.co.jp>
>>
>> Your patches do not seem to propagate the issue-flush error code, even 
>> when it is easily available.
> 
> Oops... you are right. I will fix that.

I reflected your comments in the new version of the patch set.While at 
it I also modified the respective reiserfs and xfs fsync functions so 
that, at least to some extent,they propagate the issue-flush error code.

I'll be replying to this email with the new patches.

Thanks,

Fernando

^ permalink raw reply	[flat|nested] 664+ messages in thread

* [PATCH 1/7] block: Add block_flush_device()
  2009-03-30 12:04                                                 ` Fernando Luis Vázquez Cao
@ 2009-03-30 12:09                                                   ` Fernando Luis Vázquez Cao
  2009-03-30 15:07                                                     ` Bartlomiej Zolnierkiewicz
  2009-03-30 17:34                                                     ` Linus Torvalds
  2009-03-30 12:11                                                   ` [PATCH 2/7] ext3: call blkdev_issue_flush() on fsync() Fernando Luis Vázquez Cao
                                                                     ` (5 subsequent siblings)
  6 siblings, 2 replies; 664+ messages in thread
From: Fernando Luis Vázquez Cao @ 2009-03-30 12:09 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Christoph Hellwig, Linus Torvalds, Theodore Tso, Ingo Molnar,
	Alan Cox, Arjan van de Ven, Andrew Morton, Peter Zijlstra,
	Nick Piggin, David Rees, Jesper Krogh, Linux Kernel Mailing List,
	chris.mason, david, tj

This patch adds a helper function that should be used by filesystems that need
to flush the underlying block device on fsync()/fdatasync().

Signed-off-by: Fernando Luis Vazquez Cao <fernando@oss.ntt.co.jp>
---

diff -urNp linux-2.6.29-orig/fs/buffer.c linux-2.6.29/fs/buffer.c
--- linux-2.6.29-orig/fs/buffer.c	2009-03-24 08:12:14.000000000 +0900
+++ linux-2.6.29/fs/buffer.c	2009-03-30 15:27:04.000000000 +0900
@@ -165,6 +165,17 @@ void end_buffer_write_sync(struct buffer
  	put_bh(bh);
  }

+/* Issue flush of write caches on the block device */
+int block_flush_device(struct block_device *bdev)
+{
+	int ret = 0;
+
+	ret = blkdev_issue_flush(bdev, NULL);
+
+	return (ret == -EOPNOTSUPP) ? 0 : ret;
+}
+EXPORT_SYMBOL(block_flush_device);
+
  /*
   * Write out and wait upon all the dirty data associated with a block
   * device via its mapping.  Does not take the superblock lock.
diff -urNp linux-2.6.29-orig/include/linux/buffer_head.h linux-2.6.29/include/linux/buffer_head.h
--- linux-2.6.29-orig/include/linux/buffer_head.h	2009-03-24 08:12:14.000000000 +0900
+++ linux-2.6.29/include/linux/buffer_head.h	2009-03-30 15:27:26.000000000 +0900
@@ -238,6 +238,7 @@ int nobh_write_end(struct file *, struct
  int nobh_truncate_page(struct address_space *, loff_t, get_block_t *);
  int nobh_writepage(struct page *page, get_block_t *get_block,
                          struct writeback_control *wbc);
+int block_flush_device(struct block_device *bdev);

  void buffer_init(void);


^ permalink raw reply	[flat|nested] 664+ messages in thread

* [PATCH 2/7] ext3: call blkdev_issue_flush() on fsync()
  2009-03-30 12:04                                                 ` Fernando Luis Vázquez Cao
  2009-03-30 12:09                                                   ` [PATCH 1/7] block: Add block_flush_device() Fernando Luis Vázquez Cao
@ 2009-03-30 12:11                                                   ` Fernando Luis Vázquez Cao
  2009-03-30 14:04                                                     ` Theodore Tso
  2009-03-30 12:15                                                   ` [PATCH 3/7] ext4: " Fernando Luis Vázquez Cao
                                                                     ` (4 subsequent siblings)
  6 siblings, 1 reply; 664+ messages in thread
From: Fernando Luis Vázquez Cao @ 2009-03-30 12:11 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Christoph Hellwig, Linus Torvalds, Theodore Tso, Ingo Molnar,
	Alan Cox, Arjan van de Ven, Andrew Morton, Peter Zijlstra,
	Nick Piggin, David Rees, Jesper Krogh, Linux Kernel Mailing List,
	chris.mason, david, tj

To ensure that bits are truly on-disk after an fsync or fdatasync, we
should force a disk flush explicitly when there is dirty data/metadata
and the journal didn't emit a write barrier (either because metadata is
not being synched or barriers are disabled).

Signed-off-by: Fernando Luis Vazquez Cao <fernando@oss.ntt.co.jp>
---

diff -urNp linux-2.6.29-orig/fs/ext3/fsync.c linux-2.6.29/fs/ext3/fsync.c
--- linux-2.6.29-orig/fs/ext3/fsync.c	2009-03-24 08:12:14.000000000 +0900
+++ linux-2.6.29/fs/ext3/fsync.c	2009-03-30 15:31:34.000000000 +0900
@@ -45,6 +45,8 @@
  int ext3_sync_file(struct file * file, struct dentry *dentry, int datasync)
  {
  	struct inode *inode = dentry->d_inode;
+	journal_t *journal = EXT3_SB(inode->i_sb)->s_journal;
+	unsigned long i_state = inode->i_state;
  	int ret = 0;

  	J_ASSERT(ext3_journal_current_handle() == NULL);
@@ -69,23 +71,30 @@ int ext3_sync_file(struct file * file, s
  	 */
  	if (ext3_should_journal_data(inode)) {
  		ret = ext3_force_commit(inode->i_sb);
-		goto out;
+		if (!ret && !(journal->j_flags & JFS_BARRIER))
+			ret = block_flush_device(inode->i_sb->s_bdev);
+		return ret;
  	}

-	if (datasync && !(inode->i_state & I_DIRTY_DATASYNC))
-		goto out;
+	if (datasync && !(i_state & I_DIRTY_DATASYNC)) {
+		if (i_state & I_DIRTY_PAGES)
+			ret = block_flush_device(inode->i_sb->s_bdev);
+		return ret;
+	}

  	/*
  	 * The VFS has written the file data.  If the inode is unaltered
  	 * then we need not start a commit.
  	 */
-	if (inode->i_state & (I_DIRTY_SYNC|I_DIRTY_DATASYNC)) {
+	if (i_state & (I_DIRTY_SYNC|I_DIRTY_DATASYNC)) {
  		struct writeback_control wbc = {
  			.sync_mode = WB_SYNC_ALL,
  			.nr_to_write = 0, /* sys_fsync did this */
  		};
  		ret = sync_inode(inode, &wbc);
+		if (!ret && journal && !(journal->j_flags & JFS_BARRIER))
+			ret = block_flush_device(inode->i_sb->s_bdev);
  	}
-out:
+
  	return ret;
  }

^ permalink raw reply	[flat|nested] 664+ messages in thread

* [PATCH 3/7] ext4: call blkdev_issue_flush() on fsync()
  2009-03-30 12:04                                                 ` Fernando Luis Vázquez Cao
  2009-03-30 12:09                                                   ` [PATCH 1/7] block: Add block_flush_device() Fernando Luis Vázquez Cao
  2009-03-30 12:11                                                   ` [PATCH 2/7] ext3: call blkdev_issue_flush() on fsync() Fernando Luis Vázquez Cao
@ 2009-03-30 12:15                                                   ` Fernando Luis Vázquez Cao
  2009-03-30 12:18                                                   ` [PATCH 4/7] vfs: call blkdev_issue_flush() from generic file_fsync() helper Fernando Luis Vázquez Cao
                                                                     ` (3 subsequent siblings)
  6 siblings, 0 replies; 664+ messages in thread
From: Fernando Luis Vázquez Cao @ 2009-03-30 12:15 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Christoph Hellwig, Linus Torvalds, Theodore Tso, Ingo Molnar,
	Alan Cox, Arjan van de Ven, Andrew Morton, Peter Zijlstra,
	Nick Piggin, David Rees, Jesper Krogh, Linux Kernel Mailing List,
	chris.mason, david, tj

To ensure that bits are truly on-disk after an fsync or fdatasync, we
should force a disk flush explicitly when there is dirty data/metadata
and the journal didn't emit a write barrier (either because metadata is
not being synched or barriers are disabled).

Signed-off-by: Fernando Luis Vazquez Cao <fernando@oss.ntt.co.jp>
---

diff -urNp linux-2.6.29-orig/fs/ext4/fsync.c linux-2.6.29/fs/ext4/fsync.c
--- linux-2.6.29-orig/fs/ext4/fsync.c	2009-03-24 08:12:14.000000000 +0900
+++ linux-2.6.29/fs/ext4/fsync.c	2009-03-30 15:35:26.000000000 +0900
@@ -48,6 +48,7 @@ int ext4_sync_file(struct file *file, st
  {
  	struct inode *inode = dentry->d_inode;
  	journal_t *journal = EXT4_SB(inode->i_sb)->s_journal;
+	unsigned long i_state = inode->i_state;
  	int ret = 0;

  	J_ASSERT(ext4_journal_current_handle() == NULL);
@@ -76,25 +77,30 @@ int ext4_sync_file(struct file *file, st
  	 */
  	if (ext4_should_journal_data(inode)) {
  		ret = ext4_force_commit(inode->i_sb);
-		goto out;
+		if (!ret && !(journal->j_flags & JBD2_BARRIER))
+			ret = block_flush_device(inode->i_sb->s_bdev);
+		return ret;
  	}

-	if (datasync && !(inode->i_state & I_DIRTY_DATASYNC))
-		goto out;
+	if (datasync && !(i_state & I_DIRTY_DATASYNC)) {
+		if (i_state & I_DIRTY_PAGES)
+			ret = block_flush_device(inode->i_sb->s_bdev);
+		return ret;
+	}

  	/*
  	 * The VFS has written the file data.  If the inode is unaltered
  	 * then we need not start a commit.
  	 */
-	if (inode->i_state & (I_DIRTY_SYNC|I_DIRTY_DATASYNC)) {
+	if (i_state & (I_DIRTY_SYNC|I_DIRTY_DATASYNC)) {
  		struct writeback_control wbc = {
  			.sync_mode = WB_SYNC_ALL,
  			.nr_to_write = 0, /* sys_fsync did this */
  		};
  		ret = sync_inode(inode, &wbc);
-		if (journal && (journal->j_flags & JBD2_BARRIER))
-			blkdev_issue_flush(inode->i_sb->s_bdev, NULL);
+		if (!ret && journal && !(journal->j_flags & JBD2_BARRIER))
+			ret = block_flush_device(inode->i_sb->s_bdev);
  	}
-out:
+
  	return ret;
  }

^ permalink raw reply	[flat|nested] 664+ messages in thread

* [PATCH 4/7] vfs: call blkdev_issue_flush() from generic file_fsync() helper
  2009-03-30 12:04                                                 ` Fernando Luis Vázquez Cao
                                                                     ` (2 preceding siblings ...)
  2009-03-30 12:15                                                   ` [PATCH 3/7] ext4: " Fernando Luis Vázquez Cao
@ 2009-03-30 12:18                                                   ` Fernando Luis Vázquez Cao
  2009-03-30 12:22                                                   ` [PATCH 5/7] vfs: Add wbcflush sysfs knob to disable storage device writeback cache flushes Fernando Luis Vázquez Cao
                                                                     ` (2 subsequent siblings)
  6 siblings, 0 replies; 664+ messages in thread
From: Fernando Luis Vázquez Cao @ 2009-03-30 12:18 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Christoph Hellwig, Linus Torvalds, Theodore Tso, Ingo Molnar,
	Alan Cox, Arjan van de Ven, Andrew Morton, Peter Zijlstra,
	Nick Piggin, David Rees, Jesper Krogh, Linux Kernel Mailing List,
	chris.mason, david, tj

To ensure that bits are truly on-disk after an fsync or fdatasync
we should force a disk flush explicitly. This is necessary to
have data integrity guarantees in filesystems such as FAT which
do not provide their own fsync implementation and use the vfs
helper instead.

Signed-off-by: Fernando Luis Vazquez Cao <fernando@oss.ntt.co.jp>
---

diff -urNp linux-2.6.29-orig/fs/sync.c linux-2.6.29/fs/sync.c
--- linux-2.6.29-orig/fs/sync.c	2009-03-24 08:12:14.000000000 +0900
+++ linux-2.6.29/fs/sync.c	2009-03-30 15:43:59.000000000 +0900
@@ -72,6 +72,11 @@ int file_fsync(struct file *filp, struct
  	err = sync_blockdev(sb->s_bdev);
  	if (!ret)
  		ret = err;
+
+	err = block_flush_device(sb->s_bdev);
+	if (!ret)
+		ret = err;
+
  	return ret;
  }


^ permalink raw reply	[flat|nested] 664+ messages in thread

* [PATCH 5/7] vfs: Add  wbcflush sysfs knob to disable storage device writeback cache flushes
  2009-03-30 12:04                                                 ` Fernando Luis Vázquez Cao
                                                                     ` (3 preceding siblings ...)
  2009-03-30 12:18                                                   ` [PATCH 4/7] vfs: call blkdev_issue_flush() from generic file_fsync() helper Fernando Luis Vázquez Cao
@ 2009-03-30 12:22                                                   ` Fernando Luis Vázquez Cao
  2009-03-30 12:36                                                     ` Jens Axboe
  2009-03-30 15:14                                                     ` Bartlomiej Zolnierkiewicz
  2009-03-30 12:33                                                   ` [PATCH 6/7] xfs: propagate issue-flush error code Fernando Luis Vázquez Cao
  2009-03-30 12:36                                                   ` [PATCH 7/7] reiserfs: " Fernando Luis Vázquez Cao
  6 siblings, 2 replies; 664+ messages in thread
From: Fernando Luis Vázquez Cao @ 2009-03-30 12:22 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Christoph Hellwig, Linus Torvalds, Theodore Tso, Ingo Molnar,
	Alan Cox, Arjan van de Ven, Andrew Morton, Peter Zijlstra,
	Nick Piggin, David Rees, Jesper Krogh, Linux Kernel Mailing List,
	chris.mason, david, tj

Add a sysfs knob to disable storage device writeback cache flushes.

Signed-off-by: Fernando Luis Vazquez Cao <fernando@oss.ntt.co.jp>
---

diff -urNp linux-2.6.29-orig/block/blk-barrier.c linux-2.6.29/block/blk-barrier.c
--- linux-2.6.29-orig/block/blk-barrier.c	2009-03-24 08:12:14.000000000 +0900
+++ linux-2.6.29/block/blk-barrier.c	2009-03-30 17:08:28.000000000 +0900
@@ -318,6 +318,9 @@ int blkdev_issue_flush(struct block_devi
  	if (!q)
  		return -ENXIO;

+	if (blk_queue_nowbcflush(q))
+		return -EOPNOTSUPP;
+
  	bio = bio_alloc(GFP_KERNEL, 0);
  	if (!bio)
  		return -ENOMEM;
diff -urNp linux-2.6.29-orig/block/blk-core.c linux-2.6.29/block/blk-core.c
--- linux-2.6.29-orig/block/blk-core.c	2009-03-24 08:12:14.000000000 +0900
+++ linux-2.6.29/block/blk-core.c	2009-03-30 17:08:28.000000000 +0900
@@ -1452,7 +1452,8 @@ static inline void __generic_make_reques
  			goto end_io;
  		}
  		if (bio_barrier(bio) && bio_has_data(bio) &&
-		    (q->next_ordered == QUEUE_ORDERED_NONE)) {
+		    (blk_queue_nowbcflush(q) ||
+		     q->next_ordered == QUEUE_ORDERED_NONE)) {
  			err = -EOPNOTSUPP;
  			goto end_io;
  		}
diff -urNp linux-2.6.29-orig/block/blk-sysfs.c linux-2.6.29/block/blk-sysfs.c
--- linux-2.6.29-orig/block/blk-sysfs.c	2009-03-24 08:12:14.000000000 +0900
+++ linux-2.6.29/block/blk-sysfs.c	2009-03-30 17:08:28.000000000 +0900
@@ -151,6 +151,27 @@ static ssize_t queue_nonrot_store(struct
  	return ret;
  }

+static ssize_t queue_wbcflush_show(struct request_queue *q, char *page)
+{
+	return queue_var_show(!blk_queue_nowbcflush(q), page);
+}
+
+static ssize_t queue_wbcflush_store(struct request_queue *q, const char *page,
+				    size_t count)
+{
+	unsigned long nm;
+	ssize_t ret = queue_var_store(&nm, page, count);
+
+	spin_lock_irq(q->queue_lock);
+	if (nm)
+		queue_flag_clear(QUEUE_FLAG_NOWBCFLUSH , q);
+	else
+		queue_flag_set(QUEUE_FLAG_NOWBCFLUSH , q);
+	spin_unlock_irq(q->queue_lock);
+
+	return ret;
+}
+
  static ssize_t queue_nomerges_show(struct request_queue *q, char *page)
  {
  	return queue_var_show(blk_queue_nomerges(q), page);
@@ -258,6 +279,12 @@ static struct queue_sysfs_entry queue_no
  	.store = queue_nonrot_store,
  };

+static struct queue_sysfs_entry queue_wbcflush_entry = {
+	.attr = {.name = "wbcflush", .mode = S_IRUGO | S_IWUSR },
+	.show = queue_wbcflush_show,
+	.store = queue_wbcflush_store,
+};
+
  static struct queue_sysfs_entry queue_nomerges_entry = {
  	.attr = {.name = "nomerges", .mode = S_IRUGO | S_IWUSR },
  	.show = queue_nomerges_show,
@@ -284,6 +311,7 @@ static struct attribute *default_attrs[]
  	&queue_iosched_entry.attr,
  	&queue_hw_sector_size_entry.attr,
  	&queue_nonrot_entry.attr,
+	&queue_wbcflush_entry.attr,
  	&queue_nomerges_entry.attr,
  	&queue_rq_affinity_entry.attr,
  	&queue_iostats_entry.attr,
diff -urNp linux-2.6.29-orig/include/linux/blkdev.h linux-2.6.29/include/linux/blkdev.h
--- linux-2.6.29-orig/include/linux/blkdev.h	2009-03-24 08:12:14.000000000 +0900
+++ linux-2.6.29/include/linux/blkdev.h	2009-03-30 17:08:28.000000000 +0900
@@ -452,6 +452,7 @@ struct request_queue
  #define QUEUE_FLAG_NONROT      14	/* non-rotational device (SSD) */
  #define QUEUE_FLAG_VIRT        QUEUE_FLAG_NONROT /* paravirt device */
  #define QUEUE_FLAG_IO_STAT     15	/* do IO stats */
+#define QUEUE_FLAG_NOWBCFLUSH  16	/* disable write-back cache flushing */

  #define QUEUE_FLAG_DEFAULT	((1 << QUEUE_FLAG_IO_STAT) |		\
  				 (1 << QUEUE_FLAG_CLUSTER) |		\
@@ -572,6 +573,8 @@ enum {
  #define blk_queue_stopped(q)	test_bit(QUEUE_FLAG_STOPPED, &(q)->queue_flags)
  #define blk_queue_nomerges(q)	test_bit(QUEUE_FLAG_NOMERGES, &(q)->queue_flags)
  #define blk_queue_nonrot(q)	test_bit(QUEUE_FLAG_NONROT, &(q)->queue_flags)
+#define blk_queue_nowbcflush(q)	\
+	test_bit(QUEUE_FLAG_NOWBCFLUSH, &(q)->queue_flags)
  #define blk_queue_io_stat(q)	test_bit(QUEUE_FLAG_IO_STAT, &(q)->queue_flags)
  #define blk_queue_flushing(q)	((q)->ordseq)
  #define blk_queue_stackable(q)	\

^ permalink raw reply	[flat|nested] 664+ messages in thread

* [PATCH 6/7] xfs: propagate issue-flush error code
  2009-03-30 12:04                                                 ` Fernando Luis Vázquez Cao
                                                                     ` (4 preceding siblings ...)
  2009-03-30 12:22                                                   ` [PATCH 5/7] vfs: Add wbcflush sysfs knob to disable storage device writeback cache flushes Fernando Luis Vázquez Cao
@ 2009-03-30 12:33                                                   ` Fernando Luis Vázquez Cao
  2009-03-30 15:20                                                     ` Bartlomiej Zolnierkiewicz
  2009-03-31 23:37                                                     ` Dave Chinner
  2009-03-30 12:36                                                   ` [PATCH 7/7] reiserfs: " Fernando Luis Vázquez Cao
  6 siblings, 2 replies; 664+ messages in thread
From: Fernando Luis Vázquez Cao @ 2009-03-30 12:33 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Christoph Hellwig, Linus Torvalds, Theodore Tso, Ingo Molnar,
	Alan Cox, Arjan van de Ven, Andrew Morton, Peter Zijlstra,
	Nick Piggin, David Rees, Jesper Krogh, Linux Kernel Mailing List,
	chris.mason, david, tj, bzolnier

blkdev_issue_flush() may fail (i.e. due to media error on FLUSH CACHE
command execution) so its users should check for the return value.

(This issues was first spotted Bartlomiej Zolnierkiewicz)

Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
Signed-off-by: Fernando Luis Vazquez Cao <fernando@oss.ntt.co.jp>
---

diff -urNp linux-2.6.29-orig/fs/xfs/linux-2.6/xfs_buf.c linux-2.6.29/fs/xfs/linux-2.6/xfs_buf.c
--- linux-2.6.29-orig/fs/xfs/linux-2.6/xfs_buf.c	2009-03-24 08:12:14.000000000 +0900
+++ linux-2.6.29/fs/xfs/linux-2.6/xfs_buf.c	2009-03-30 14:44:34.000000000 +0900
@@ -1446,6 +1446,7 @@ xfs_free_buftarg(
  {
  	xfs_flush_buftarg(btp, 1);
  	if (mp->m_flags & XFS_MOUNT_BARRIER)
+		/* FIXME: check return value */
  		xfs_blkdev_issue_flush(btp);
  	xfs_free_bufhash(btp);
  	iput(btp->bt_mapping->host);
diff -urNp linux-2.6.29-orig/fs/xfs/linux-2.6/xfs_super.c linux-2.6.29/fs/xfs/linux-2.6/xfs_super.c
--- linux-2.6.29-orig/fs/xfs/linux-2.6/xfs_super.c	2009-03-24 08:12:14.000000000 +0900
+++ linux-2.6.29/fs/xfs/linux-2.6/xfs_super.c	2009-03-30 15:16:42.000000000 +0900
@@ -721,11 +721,11 @@ xfs_mountfs_check_barriers(xfs_mount_t *
  	}
  }

-void
+int
  xfs_blkdev_issue_flush(
  	xfs_buftarg_t		*buftarg)
  {
-	blkdev_issue_flush(buftarg->bt_bdev, NULL);
+	return block_flush_device(buftarg->bt_bdev);
  }

  STATIC void
diff -urNp linux-2.6.29-orig/fs/xfs/linux-2.6/xfs_super.h linux-2.6.29/fs/xfs/linux-2.6/xfs_super.h
--- linux-2.6.29-orig/fs/xfs/linux-2.6/xfs_super.h	2009-03-24 08:12:14.000000000 +0900
+++ linux-2.6.29/fs/xfs/linux-2.6/xfs_super.h	2009-03-30 14:46:31.000000000 +0900
@@ -89,7 +89,7 @@ struct block_device;

  extern __uint64_t xfs_max_file_offset(unsigned int);

-extern void xfs_blkdev_issue_flush(struct xfs_buftarg *);
+extern int xfs_blkdev_issue_flush(struct xfs_buftarg *);

  extern const struct export_operations xfs_export_operations;
  extern struct xattr_handler *xfs_xattr_handlers[];
diff -urNp linux-2.6.29-orig/fs/xfs/xfs_vnodeops.c linux-2.6.29/fs/xfs/xfs_vnodeops.c
--- linux-2.6.29-orig/fs/xfs/xfs_vnodeops.c	2009-03-24 08:12:14.000000000 +0900
+++ linux-2.6.29/fs/xfs/xfs_vnodeops.c	2009-03-30 15:08:21.000000000 +0900
@@ -678,20 +678,20 @@ xfs_fsync(
  		xfs_iunlock(ip, XFS_ILOCK_EXCL);
  	}

-	if ((ip->i_mount->m_flags & XFS_MOUNT_BARRIER) && changed) {
+	if (!error && (ip->i_mount->m_flags & XFS_MOUNT_BARRIER) && changed) {
  		/*
  		 * If the log write didn't issue an ordered tag we need
  		 * to flush the disk cache for the data device now.
  		 */
  		if (!log_flushed)
-			xfs_blkdev_issue_flush(ip->i_mount->m_ddev_targp);
+			error = xfs_blkdev_issue_flush(ip->i_mount->m_ddev_targp);

  		/*
  		 * If this inode is on the RT dev we need to flush that
  		 * cache as well.
  		 */
-		if (XFS_IS_REALTIME_INODE(ip))
-			xfs_blkdev_issue_flush(ip->i_mount->m_rtdev_targp);
+		if (!error && XFS_IS_REALTIME_INODE(ip))
+			error = xfs_blkdev_issue_flush(ip->i_mount->m_rtdev_targp);
  	}

  	return error;

^ permalink raw reply	[flat|nested] 664+ messages in thread

* [PATCH 7/7] reiserfs: propagate issue-flush error code
  2009-03-30 12:04                                                 ` Fernando Luis Vázquez Cao
                                                                     ` (5 preceding siblings ...)
  2009-03-30 12:33                                                   ` [PATCH 6/7] xfs: propagate issue-flush error code Fernando Luis Vázquez Cao
@ 2009-03-30 12:36                                                   ` Fernando Luis Vázquez Cao
  2009-03-30 15:25                                                     ` Bartlomiej Zolnierkiewicz
  6 siblings, 1 reply; 664+ messages in thread
From: Fernando Luis Vázquez Cao @ 2009-03-30 12:36 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Christoph Hellwig, Linus Torvalds, Theodore Tso, Ingo Molnar,
	Alan Cox, Arjan van de Ven, Andrew Morton, Peter Zijlstra,
	Nick Piggin, David Rees, Jesper Krogh, Linux Kernel Mailing List,
	chris.mason, david, tj, bzolnier

blkdev_issue_flush() may fail (i.e. due to media error on FLUSH CACHE
command execution) so its users should check for the return value.

Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
Signed-off-by: Fernando Luis Vazquez Cao <fernando@oss.ntt.co.jp>
---

diff -urNp linux-2.6.29-orig/fs/reiserfs/file.c linux-2.6.29/fs/reiserfs/file.c
--- linux-2.6.29-orig/fs/reiserfs/file.c	2009-03-24 08:12:14.000000000 +0900
+++ linux-2.6.29/fs/reiserfs/file.c	2009-03-30 16:19:19.000000000 +0900
@@ -146,8 +146,9 @@ static int reiserfs_sync_file(struct fil
  	reiserfs_write_lock(p_s_inode->i_sb);
  	barrier_done = reiserfs_commit_for_inode(p_s_inode);
  	reiserfs_write_unlock(p_s_inode->i_sb);
-	if (barrier_done != 1 && reiserfs_barrier_flush(p_s_inode->i_sb))
-		blkdev_issue_flush(p_s_inode->i_sb->s_bdev, NULL);
+	if (!n_err && barrier_done != 1 &&
+			reiserfs_barrier_flush(p_s_inode->i_sb))
+		n_err = block_flush_device(p_s_inode->i_sb->s_bdev);
  	if (barrier_done < 0)
  		return barrier_done;
  	return (n_err < 0) ? -EIO : 0;

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: [PATCH 5/7] vfs: Add  wbcflush sysfs knob to disable storage device  writeback cache flushes
  2009-03-30 12:22                                                   ` [PATCH 5/7] vfs: Add wbcflush sysfs knob to disable storage device writeback cache flushes Fernando Luis Vázquez Cao
@ 2009-03-30 12:36                                                     ` Jens Axboe
  2009-03-30 14:18                                                       ` Fernando Luis Vázquez Cao
  2009-03-30 15:14                                                     ` Bartlomiej Zolnierkiewicz
  1 sibling, 1 reply; 664+ messages in thread
From: Jens Axboe @ 2009-03-30 12:36 UTC (permalink / raw)
  To: Fernando Luis Vázquez Cao
  Cc: Jeff Garzik, Christoph Hellwig, Linus Torvalds, Theodore Tso,
	Ingo Molnar, Alan Cox, Arjan van de Ven, Andrew Morton,
	Peter Zijlstra, Nick Piggin, David Rees, Jesper Krogh,
	Linux Kernel Mailing List, chris.mason, david, tj

On Mon, Mar 30 2009, Fernando Luis Vázquez Cao wrote:
> Add a sysfs knob to disable storage device writeback cache flushes.
>
> Signed-off-by: Fernando Luis Vazquez Cao <fernando@oss.ntt.co.jp>
> ---
>
> diff -urNp linux-2.6.29-orig/block/blk-barrier.c linux-2.6.29/block/blk-barrier.c
> --- linux-2.6.29-orig/block/blk-barrier.c	2009-03-24 08:12:14.000000000 +0900
> +++ linux-2.6.29/block/blk-barrier.c	2009-03-30 17:08:28.000000000 +0900
> @@ -318,6 +318,9 @@ int blkdev_issue_flush(struct block_devi
>  	if (!q)
>  		return -ENXIO;
>
> +	if (blk_queue_nowbcflush(q))
> +		return -EOPNOTSUPP;
> +
>  	bio = bio_alloc(GFP_KERNEL, 0);
>  	if (!bio)
>  		return -ENOMEM;
> diff -urNp linux-2.6.29-orig/block/blk-core.c linux-2.6.29/block/blk-core.c
> --- linux-2.6.29-orig/block/blk-core.c	2009-03-24 08:12:14.000000000 +0900
> +++ linux-2.6.29/block/blk-core.c	2009-03-30 17:08:28.000000000 +0900
> @@ -1452,7 +1452,8 @@ static inline void __generic_make_reques
>  			goto end_io;
>  		}
>  		if (bio_barrier(bio) && bio_has_data(bio) &&
> -		    (q->next_ordered == QUEUE_ORDERED_NONE)) {
> +		    (blk_queue_nowbcflush(q) ||
> +		     q->next_ordered == QUEUE_ORDERED_NONE)) {
>  			err = -EOPNOTSUPP;
>  			goto end_io;
>  		}

This (and the above hunk) should be changed. -EOPNOTSUPP means the
target does not support barriers, that is a different thing to flushes
not being needed. A file system issuing a barrier and getting
-EOPNOTSUPP back will disable barriers, since it now thinks that
ordering cannot be guaranteed.

A more appropriate change would be to successfully complete a flush
without actually sending it down to the device if blk_queue_nowbcflush()
is true. Then blkdev_issue_flush() would just work as well. It also
needs to take stacking into account, or stacked drivers will have to
propagate the settings up the stack. If you allow simply the barrier to
be passed down, you get that for free.

> +static struct queue_sysfs_entry queue_wbcflush_entry = {
> +	.attr = {.name = "wbcflush", .mode = S_IRUGO | S_IWUSR },
> +	.show = queue_wbcflush_show,
> +	.store = queue_wbcflush_store,
> +};
> +

Naming is also pretty bad, perhaps something like "honor_cache_flush"
would be better, or perhaps "cache_flush_needed". At least something
that is more descriptive of this setting actually controls, wbcflush
does not do that.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-29 23:14                                                                             ` Dave Chinner
  2009-03-30  0:39                                                                               ` Theodore Tso
  2009-03-30  3:01                                                                               ` Mark Lord
@ 2009-03-30 12:55                                                                               ` Chris Mason
  2009-03-30 17:42                                                                                 ` Theodore Tso
  2009-03-31 23:55                                                                                 ` Dave Chinner
  2 siblings, 2 replies; 664+ messages in thread
From: Chris Mason @ 2009-03-30 12:55 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Mark Lord, Stefan Richter, Jeff Garzik, Linus Torvalds,
	Matthew Garrett, Alan Cox, Theodore Tso, Andrew Morton,
	David Rees, Jesper Krogh, Linux Kernel Mailing List

On Mon, 2009-03-30 at 10:14 +1100, Dave Chinner wrote:
> On Sat, Mar 28, 2009 at 11:17:08AM -0400, Mark Lord wrote:
> > The better solution seems to be the rather obvious one:
> >
> >   the filesystem should commit data to disk before altering metadata.
> 
> Generalities are bad. For example:
> 
> write();
> unlink();
> <do more stuff>
> close();
> 
> This is a clear case where you want metadata changed before data is
> committed to disk. In many cases, you don't even want the data to
> hit the disk here.
> 
> Similarly, rsync does the magic open,write,close,rename sequence
> without an fsync before the rename. And it doesn't need the fsync,
> either. The proposed implicit fsync on rename will kill rsync
> performance, and I think that may make many people unhappy....
> 

Sorry, I'm afraid that rsync falls into the same category as the
kde/gnome apps here.

There are a lot of backup programs built around rsync, and every one of
them risks losing the old copy of the file by renaming an unflushed new
copy over it.

rsync needs the flushing about a million times more than gnome and kde,
and it doesn't have any option to do it automatically.  It does have the
option to create backups, which is how a percentage of people are using
it, but I wouldn't call its current setup safe outside of ext3.

-chris



^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-30  3:55                                                                                     ` Trenton D. Adams
@ 2009-03-30 13:45                                                                                       ` Theodore Tso
  0 siblings, 0 replies; 664+ messages in thread
From: Theodore Tso @ 2009-03-30 13:45 UTC (permalink / raw)
  To: Trenton D. Adams
  Cc: Mark Lord, Stefan Richter, Jeff Garzik, Linus Torvalds,
	Matthew Garrett, Alan Cox, Andrew Morton, David Rees,
	Jesper Krogh, Linux Kernel Mailing List

On Sun, Mar 29, 2009 at 09:55:59PM -0600, Trenton D. Adams wrote:
> > (This is with a filesystem formated as ext3, and mounted as either
> > ext3 or ext4; if the filesystem is formatted using "mke2fs -t ext4",
> > what you see is a very smooth 1.2-1.5 seconds fsync latency, indirect
> > blocks for very big files end up being quite inefficient.)
> 
> Oh.  I thought I had read somewhere that mounting ext4 over ext3 would
> solve the problem.  Not sure where I read that now.  Sorry for wasting
> your time.

Well, I believe it should solve it for most realistic workloads (where
I don't think "dd if=/dev/zero of=bigzero.img" is realistic).  

Looking more closely at the statistics, the delays aren't coming from
trying to flush the data blocks in data=ordered mode.  If we disable
delayed allocation (mount -o nodelalloc), you'll see this when you
look at /proc/fs/jbd2/<dev>/history:

R/C  tid   wait  run   lock  flush log   hndls  block inlog ctime write drop  close
R    12    23    3836  0     1460  2563  50129  56    57   
R    13    0     5023  0     1056  2100  64436  70    71   
R    14    0     3156  0     1433  1803  40816  47    48   
R    15    0     4250  0     1206  2473  57623  63    64   
R    16    0     5000  0     1516  1136  61087  67    68   

Note the amount of time in milliseconds in the flush column.  That's
time spent flusing the allocated data blocks to disk.  This goes away
once you enable delayed allocation:

R/C  tid   wait  run   lock  flush log   hndls  block inlog ctime write drop  close
R    56    0     2283  0     10    1250  32735  37    38   
R    57    0     2463  0     13    1126  31297  38    39   
R    58    0     2413  0     13    1243  35340  40    41   
R    59    3     2383  0     20    1270  30760  38    39   
R    60    0     2316  0     23    1176  33696  38    39   
R    61    0     2266  0     23    1150  29888  37    38   
R    62    0     2490  0     26    1140  35661  39    40   

You may see slightly worse times since I'm running with a patch (which
will be pushed for 2.6.30) that makes sure that the blocks we are
writing during the "log" phase are written using WRITE_SYNC instead of
WRITE.  (Without this patch, the huge amount of writes caused by the
VM trying to keep up with pages being dirtied at CPU speeds via "dd
if=/dev/zero..." will interfere with writes to the journal.)

During the log phase (which is averaging around 2 seconds for
nodealloc, and 1 seconds with delayed allocation enabled), we write
the metadata to the journal.  The number of blocks that we are
actually writing to the journal is small (around 40 per transaction)
so I suspect we're seeing some lock contention or some accounting
overhead caused by the metadata blocks constantly getting dirtied by
dd if=/dev/zero task.  We can look to see if this can be improved,
possibly by changing how we handle the locking, but it's no longer
being caused by the data=ordered flushing behaviour.

> Yes, I realize that.  When trying to find performance problems I try
> to be as *unfair* as possible. :D

And that's a good thing from a development point of view when trying
to fix performance problems.  When making statements about what people
are likely to find in real life, it's less useful.

    	      	      	   	      	   - Ted

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-30 11:17                                                                                       ` Ric Wheeler
@ 2009-03-30 13:48                                                                                         ` Mark Lord
  2009-03-30 14:00                                                                                           ` Ric Wheeler
  2009-03-30 15:34                                                                                         ` Linus Torvalds
  1 sibling, 1 reply; 664+ messages in thread
From: Mark Lord @ 2009-03-30 13:48 UTC (permalink / raw)
  To: Ric Wheeler
  Cc: Andreas T.Auer, Alan Cox, Theodore Tso, Stefan Richter,
	Jeff Garzik, Linus Torvalds, Matthew Garrett, Andrew Morton,
	David Rees, Jesper Krogh, Linux Kernel Mailing List

Ric Wheeler wrote:
>
> People keep forgetting that storage (even on your commodity s-ata class 
> of drives) has very large & volatile cache. The disk firmware can hold 
> writes in that cache as long as it wants, reorder its writes into 
> anything that makes sense and has no explicit ordering promises.
..

Hi Ric,

No, we don't forget about those drive caches.  But in practice,
for nearly everyone, they don't actually matter.

The kernel can crash, and the drives, in practice, will still
flush their caches to media by themselves.  Within a second or two.

Sure, there are cases where this might not happen (total power fail),
but those are quite rare for desktop users -- and especially for the
most common variety of desktop user:  notebook users (whose machines
have built-in UPSs).

Cheers

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-30  6:31                                                                                 ` Dave Chinner
@ 2009-03-30 13:55                                                                                   ` Theodore Tso
  0 siblings, 0 replies; 664+ messages in thread
From: Theodore Tso @ 2009-03-30 13:55 UTC (permalink / raw)
  To: Mark Lord, Stefan Richter, Jeff Garzik, Linus Torvalds,
	Matthew Garrett, Alan Cox, Andrew Morton, David Rees,
	Jesper Krogh, Linux Kernel Mailing List

On Mon, Mar 30, 2009 at 05:31:10PM +1100, Dave Chinner wrote:
> 
> Pardon my french, but that is a fucking joke.
> 
> You are making a judgement call that one application is more
> important than another application and trying to impose that on
> everyone. You are saying that we should perturb a well designed and
> written backup application that is embedded into critical scripts
> all around the world for the sake of desktop application that has
> developers that are too fucking lazy to fix their bugs.

You are welcome to argue with the desktop application writers (and
Linus, who has sided with them).  I *knew* this was a fight I was not
going to win, so I implemented the replace-via-rename workaround, even
before I started trying to convince applicaiton writers that they
should write more portable code that would be safe on filesystems such
as, say, XFS.  And it looks like we're losing that battle as well;
it's hard to get people to write correct, portable code!  (I *told*
the application writers that I was the moderate on this one, even as
they were flaming me to a crisp.  Given that I'm taking flak from both
sides, it's to me a good indication that the design choices made for
ext4 was probably the right thing.)

> If you want to trade rsync performance for desktop performance, do
> it in the filesystem that is aimed at the desktop. Don't fuck rename
> up for filesystems that are aimed at the server market and don't
> want to implement performance sucking hacks to work around fucked up
> desktop applications.

What I did was create a mount option for system administrators
interested in the server market.  And an rsync option that unlinks the
target filesystem first really isn't that big of a deal --- have you
seen how many options rsync already has?  It's been a running joke
with the rsync developers.  :-)

If XFS doesn't want to try to support the desktop market, that's fine
--- it's your choice.  But at least as far as desktop application
programmers, this is not a fight we're going to win.  It makes me sad,
but I'm enough of a realist to understand that.

	     	     	   	       	     - Ted

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-30 13:48                                                                                         ` Mark Lord
@ 2009-03-30 14:00                                                                                           ` Ric Wheeler
  2009-03-30 14:44                                                                                             ` Mark Lord
  0 siblings, 1 reply; 664+ messages in thread
From: Ric Wheeler @ 2009-03-30 14:00 UTC (permalink / raw)
  To: Mark Lord
  Cc: Andreas T.Auer, Alan Cox, Theodore Tso, Stefan Richter,
	Jeff Garzik, Linus Torvalds, Matthew Garrett, Andrew Morton,
	David Rees, Jesper Krogh, Linux Kernel Mailing List

Mark Lord wrote:
> Ric Wheeler wrote:
>>
>> People keep forgetting that storage (even on your commodity s-ata 
>> class of drives) has very large & volatile cache. The disk firmware 
>> can hold writes in that cache as long as it wants, reorder its writes 
>> into anything that makes sense and has no explicit ordering promises.
> ..
> 
> Hi Ric,
> 
> No, we don't forget about those drive caches.  But in practice,
> for nearly everyone, they don't actually matter.

Here I disagree - nearly everyone has their critical data being manipulated in 
large data centers on top of Linux servers. We all can routinely suffer when 
linux crashes and loses data at big sites like google, amazon, hospitals or your 
local bank.

It definitely does matter in practice, we usually just don't see it first hand :-)


> 
> The kernel can crash, and the drives, in practice, will still
> flush their caches to media by themselves.  Within a second or two.

Even with desktops, I am not positive that the drive write cache survives a 
kernel crash without data loss. If I remember correctly, Chris's tests used 
crashes (not power outages) to display the data corruption that happened without 
barriers being enabled properly.

> 
> Sure, there are cases where this might not happen (total power fail),
> but those are quite rare for desktop users -- and especially for the
> most common variety of desktop user:  notebook users (whose machines
> have built-in UPSs).
> 
> Cheers

Unless of course you push your luck with your battery and run it until really 
out of power, but in general, I do agree that laptops and notebook users have a 
reasonably robust built in UPS.

ric


^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: [PATCH 2/7] ext3: call blkdev_issue_flush() on fsync()
  2009-03-30 12:11                                                   ` [PATCH 2/7] ext3: call blkdev_issue_flush() on fsync() Fernando Luis Vázquez Cao
@ 2009-03-30 14:04                                                     ` Theodore Tso
  2009-03-30 14:15                                                       ` Chris Mason
  0 siblings, 1 reply; 664+ messages in thread
From: Theodore Tso @ 2009-03-30 14:04 UTC (permalink / raw)
  To: Fernando Luis Vázquez Cao
  Cc: Jeff Garzik, Christoph Hellwig, Linus Torvalds, Ingo Molnar,
	Alan Cox, Arjan van de Ven, Andrew Morton, Peter Zijlstra,
	Nick Piggin, David Rees, Jesper Krogh, Linux Kernel Mailing List,
	chris.mason, david, tj

On Mon, Mar 30, 2009 at 09:11:58PM +0900, Fernando Luis Vázquez Cao wrote:
> To ensure that bits are truly on-disk after an fsync or fdatasync, we
> should force a disk flush explicitly when there is dirty data/metadata
> and the journal didn't emit a write barrier (either because metadata is
> not being synched or barriers are disabled).

NACK.

As Eric commented on linux-ext4 (and I think it was Chris Mason
deserves the credit for originally pointing this out), we don't need
to call blkdev_issue_flush() after calling sync_inode().  That's
because sync_inode() eventually (after going through a very deep call
tree inside fs/fs-writeback.c) __sync_single_inode(), which calls
write_inode(), which calls the filesystem-specific ->write_inode()
function, which for both ext3 and ext4, ends up calling
ext[34]_force_commit.  Which, if barriers are enabled, will end up
issuing a barrier after writing the commit block.

In the code paths that don't end up calling sync_inode() or
ext4_force_commit(), (i.e., in the fdatasync() case) calling
block_flush_device is appropriate.  But as it stands, this patch (and
the related one for ext4) will result in multiple unnecessary barrier
requests being sent to the block layer.

So two out of the three places where this patch adds
block_flush_device() are not necessary; as far as I can tell, only
this one is one we should add.

> -	if (datasync && !(inode->i_state & I_DIRTY_DATASYNC))
> -		goto out;
> +	if (datasync && !(i_state & I_DIRTY_DATASYNC)) {
> +		if (i_state & I_DIRTY_PAGES)
> +			ret = block_flush_device(inode->i_sb->s_bdev);
> +		return ret;
> +	}

A similar fixup is needed for the ext4 patch.

(And can we please start a new thread for these patches?  Thanks!!)

Regards,

     	    	   	       	      	  	- Ted


^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: [PATCH 2/7] ext3: call blkdev_issue_flush() on fsync()
  2009-03-30 14:04                                                     ` Theodore Tso
@ 2009-03-30 14:15                                                       ` Chris Mason
  2009-03-30 14:33                                                         ` Theodore Tso
  0 siblings, 1 reply; 664+ messages in thread
From: Chris Mason @ 2009-03-30 14:15 UTC (permalink / raw)
  To: Theodore Tso
  Cc: Fernando Luis Vázquez Cao, Jeff Garzik, Christoph Hellwig,
	Linus Torvalds, Ingo Molnar, Alan Cox, Arjan van de Ven,
	Andrew Morton, Peter Zijlstra, Nick Piggin, David Rees,
	Jesper Krogh, Linux Kernel Mailing List, david, tj

On Mon, 2009-03-30 at 10:04 -0400, Theodore Tso wrote:
> On Mon, Mar 30, 2009 at 09:11:58PM +0900, Fernando Luis Vázquez Cao wrote:
> > To ensure that bits are truly on-disk after an fsync or fdatasync, we
> > should force a disk flush explicitly when there is dirty data/metadata
> > and the journal didn't emit a write barrier (either because metadata is
> > not being synched or barriers are disabled).
> 
> NACK.
> 
> As Eric commented on linux-ext4 (and I think it was Chris Mason
> deserves the credit for originally pointing this out), we don't need
> to call blkdev_issue_flush() after calling sync_inode().  That's
> because sync_inode() eventually (after going through a very deep call
> tree inside fs/fs-writeback.c) __sync_single_inode(), which calls
> write_inode(), which calls the filesystem-specific ->write_inode()
> function, which for both ext3 and ext4, ends up calling
> ext[34]_force_commit.  Which, if barriers are enabled, will end up
> issuing a barrier after writing the commit block.
> 
> In the code paths that don't end up calling sync_inode() or
> ext4_force_commit(), (i.e., in the fdatasync() case) calling
> block_flush_device is appropriate.  But as it stands, this patch (and
> the related one for ext4) will result in multiple unnecessary barrier
> requests being sent to the block layer.
> 

I'm not sure we want to stick Fernando with changing how barriers are
done in individual filesystems, his patch is just changing the existing
call points.

The ext34 code is especially tricky because there's no way to tell if a
commit was actually done by sync_inode, so there's no way to know if an
extra flush is really required.

I think we'll be better off if he keeps the existing logic and a
different patch set is made for tuning the ext3 and ext4 code.

-chris



^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: [PATCH 5/7] vfs: Add  wbcflush sysfs knob to disable storage device writeback cache flushes
  2009-03-30 12:36                                                     ` Jens Axboe
@ 2009-03-30 14:18                                                       ` Fernando Luis Vázquez Cao
  2009-03-30 14:35                                                         ` Jens Axboe
  0 siblings, 1 reply; 664+ messages in thread
From: Fernando Luis Vázquez Cao @ 2009-03-30 14:18 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Jeff Garzik, Christoph Hellwig, Linus Torvalds, Theodore Tso,
	Ingo Molnar, Alan Cox, Arjan van de Ven, Andrew Morton,
	Peter Zijlstra, Nick Piggin, David Rees, Jesper Krogh,
	Linux Kernel Mailing List, chris.mason, david, tj

Jens Axboe wrote:
> On Mon, Mar 30 2009, Fernando Luis Vázquez Cao wrote:
>> Add a sysfs knob to disable storage device writeback cache flushes.
>>
>> Signed-off-by: Fernando Luis Vazquez Cao <fernando@oss.ntt.co.jp>
>> ---
>>
>> diff -urNp linux-2.6.29-orig/block/blk-barrier.c linux-2.6.29/block/blk-barrier.c
>> --- linux-2.6.29-orig/block/blk-barrier.c	2009-03-24 08:12:14.000000000 +0900
>> +++ linux-2.6.29/block/blk-barrier.c	2009-03-30 17:08:28.000000000 +0900
>> @@ -318,6 +318,9 @@ int blkdev_issue_flush(struct block_devi
>>  	if (!q)
>>  		return -ENXIO;
>>
>> +	if (blk_queue_nowbcflush(q))
>> +		return -EOPNOTSUPP;
>> +
>>  	bio = bio_alloc(GFP_KERNEL, 0);
>>  	if (!bio)
>>  		return -ENOMEM;
>> diff -urNp linux-2.6.29-orig/block/blk-core.c linux-2.6.29/block/blk-core.c
>> --- linux-2.6.29-orig/block/blk-core.c	2009-03-24 08:12:14.000000000 +0900
>> +++ linux-2.6.29/block/blk-core.c	2009-03-30 17:08:28.000000000 +0900
>> @@ -1452,7 +1452,8 @@ static inline void __generic_make_reques
>>  			goto end_io;
>>  		}
>>  		if (bio_barrier(bio) && bio_has_data(bio) &&
>> -		    (q->next_ordered == QUEUE_ORDERED_NONE)) {
>> +		    (blk_queue_nowbcflush(q) ||
>> +		     q->next_ordered == QUEUE_ORDERED_NONE)) {
>>  			err = -EOPNOTSUPP;
>>  			goto end_io;
>>  		}
> 
> This (and the above hunk) should be changed. -EOPNOTSUPP means the
> target does not support barriers, that is a different thing to flushes
> not being needed. A file system issuing a barrier and getting
> -EOPNOTSUPP back will disable barriers, since it now thinks that
> ordering cannot be guaranteed.

The reason I decided to use -EOPNOTSUPP was that I wanted to keep
barriers and device flushes from entering the block layer when
they are not needed. I feared that if we pass them down the block
stack (knowing in advance they will not be actually submitted to
disk) we may end up slowing things down unnecessarily.

As you mentioned, filesystems such as ext3/4 will disable
barriers if they get -EOPNOTSUPP when issuing one. I was planning
to add a notifier mechanism so that we can notify filesystems has
been a change in the barrier settings. This might be
over-engineering, though. Especially considering that "-o
remount,barrier=1" will bring us the barriers back.

> A more appropriate change would be to successfully complete a flush
> without actually sending it down to the device if blk_queue_nowbcflush()
> is true. Then blkdev_issue_flush() would just work as well. It also
> needs to take stacking into account, or stacked drivers will have to
> propagate the settings up the stack. If you allow simply the barrier to
> be passed down, you get that for free.

Aren't we risking slowing things down? Does the small optimization above
make sense (especially taking the remount trick into account)?

Maybe I am worrying too much about the possible performance penalty.

>> +static struct queue_sysfs_entry queue_wbcflush_entry = {
>> +	.attr = {.name = "wbcflush", .mode = S_IRUGO | S_IWUSR },
>> +	.show = queue_wbcflush_show,
>> +	.store = queue_wbcflush_store,
>> +};
>> +
> 
> Naming is also pretty bad, perhaps something like "honor_cache_flush"
> would be better, or perhaps "cache_flush_needed". At least something
> that is more descriptive of this setting actually controls, wbcflush
> does not do that.

You are right, wbcflush is a pretty ugly name. I will use
"honor_cache_flush" in the next iteration of the patches.


Thanks,

Fernando

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: [PATCH 2/7] ext3: call blkdev_issue_flush() on fsync()
  2009-03-30 14:15                                                       ` Chris Mason
@ 2009-03-30 14:33                                                         ` Theodore Tso
  2009-03-31  1:26                                                           ` Tejun Heo
  0 siblings, 1 reply; 664+ messages in thread
From: Theodore Tso @ 2009-03-30 14:33 UTC (permalink / raw)
  To: Chris Mason
  Cc: Fernando Luis Vázquez Cao, Jeff Garzik, Christoph Hellwig,
	Linus Torvalds, Ingo Molnar, Alan Cox, Arjan van de Ven,
	Andrew Morton, Peter Zijlstra, Nick Piggin, David Rees,
	Jesper Krogh, Linux Kernel Mailing List, david, tj

On Mon, Mar 30, 2009 at 10:15:51AM -0400, Chris Mason wrote:
> 
> I'm not sure we want to stick Fernando with changing how barriers are
> done in individual filesystems, his patch is just changing the existing
> call points.

Well, his patch actually added some calls to block_issue_flush().  But
yes, it's probably better if he just changes the existing call points,
and we can have the relevant filesystem maintainers double check to
make sure that there aren't any new call points which are needed.

> The ext34 code is especially tricky because there's no way to tell if a
> commit was actually done by sync_inode, so there's no way to know if an
> extra flush is really required.

Yes, good point.  What we need to do is to save inode->i_state
*before* the call to sync_inode(), and issue the flush if the original
value of (inode->i_state & I_DIRTY) == I_DIRTY_PAGES.  But yeah,
that's tricky.

> I think we'll be better off if he keeps the existing logic and a
> different patch set is made for tuning the ext3 and ext4 code.

Agreed.

						- Ted

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: [PATCH 5/7] vfs: Add  wbcflush sysfs knob to disable storage device writeback cache flushes
  2009-03-30 14:18                                                       ` Fernando Luis Vázquez Cao
@ 2009-03-30 14:35                                                         ` Jens Axboe
  2009-03-31  6:49                                                           ` Fernando Luis Vázquez Cao
  0 siblings, 1 reply; 664+ messages in thread
From: Jens Axboe @ 2009-03-30 14:35 UTC (permalink / raw)
  To: Fernando Luis Vázquez Cao
  Cc: Jeff Garzik, Christoph Hellwig, Linus Torvalds, Theodore Tso,
	Ingo Molnar, Alan Cox, Arjan van de Ven, Andrew Morton,
	Peter Zijlstra, Nick Piggin, David Rees, Jesper Krogh,
	Linux Kernel Mailing List, chris.mason, david, tj

On Mon, Mar 30 2009, Fernando Luis Vázquez Cao wrote:
> Jens Axboe wrote:
>> On Mon, Mar 30 2009, Fernando Luis Vázquez Cao wrote:
>>> Add a sysfs knob to disable storage device writeback cache flushes.
>>>
>>> Signed-off-by: Fernando Luis Vazquez Cao <fernando@oss.ntt.co.jp>
>>> ---
>>>
>>> diff -urNp linux-2.6.29-orig/block/blk-barrier.c linux-2.6.29/block/blk-barrier.c
>>> --- linux-2.6.29-orig/block/blk-barrier.c	2009-03-24 08:12:14.000000000 +0900
>>> +++ linux-2.6.29/block/blk-barrier.c	2009-03-30 17:08:28.000000000 +0900
>>> @@ -318,6 +318,9 @@ int blkdev_issue_flush(struct block_devi
>>>  	if (!q)
>>>  		return -ENXIO;
>>>
>>> +	if (blk_queue_nowbcflush(q))
>>> +		return -EOPNOTSUPP;
>>> +
>>>  	bio = bio_alloc(GFP_KERNEL, 0);
>>>  	if (!bio)
>>>  		return -ENOMEM;
>>> diff -urNp linux-2.6.29-orig/block/blk-core.c linux-2.6.29/block/blk-core.c
>>> --- linux-2.6.29-orig/block/blk-core.c	2009-03-24 08:12:14.000000000 +0900
>>> +++ linux-2.6.29/block/blk-core.c	2009-03-30 17:08:28.000000000 +0900
>>> @@ -1452,7 +1452,8 @@ static inline void __generic_make_reques
>>>  			goto end_io;
>>>  		}
>>>  		if (bio_barrier(bio) && bio_has_data(bio) &&
>>> -		    (q->next_ordered == QUEUE_ORDERED_NONE)) {
>>> +		    (blk_queue_nowbcflush(q) ||
>>> +		     q->next_ordered == QUEUE_ORDERED_NONE)) {
>>>  			err = -EOPNOTSUPP;
>>>  			goto end_io;
>>>  		}
>>
>> This (and the above hunk) should be changed. -EOPNOTSUPP means the
>> target does not support barriers, that is a different thing to flushes
>> not being needed. A file system issuing a barrier and getting
>> -EOPNOTSUPP back will disable barriers, since it now thinks that
>> ordering cannot be guaranteed.
>
> The reason I decided to use -EOPNOTSUPP was that I wanted to keep
> barriers and device flushes from entering the block layer when
> they are not needed. I feared that if we pass them down the block
> stack (knowing in advance they will not be actually submitted to
> disk) we may end up slowing things down unnecessarily.

But that's just wrong, you need to make sure that the block layer / io
scheduler doesn't reorder as well. It's a lot more complex than just the
device end. So just returning -EOPNOTSUPP and pretending that you need
not use barriers at the fs end is just wrong.

> As you mentioned, filesystems such as ext3/4 will disable
> barriers if they get -EOPNOTSUPP when issuing one. I was planning
> to add a notifier mechanism so that we can notify filesystems has
> been a change in the barrier settings. This might be
> over-engineering, though. Especially considering that "-o
> remount,barrier=1" will bring us the barriers back.

I think that is over-engineering.

>> A more appropriate change would be to successfully complete a flush
>> without actually sending it down to the device if blk_queue_nowbcflush()
>> is true. Then blkdev_issue_flush() would just work as well. It also
>> needs to take stacking into account, or stacked drivers will have to
>> propagate the settings up the stack. If you allow simply the barrier to
>> be passed down, you get that for free.
>
> Aren't we risking slowing things down? Does the small optimization above
> make sense (especially taking the remount trick into account)?

It's not, I think you are missing the bigger picture.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: [PATCH 2/2] Make relatime default
  2009-03-26 18:48                                                 ` Alan Cox
  2009-03-26 22:27                                                   ` Linus Torvalds
@ 2009-03-30 14:42                                                   ` Andrea Arcangeli
  2009-03-30 14:52                                                     ` Xavier Bestel
  2009-03-30 19:26                                                     ` Bill Davidsen
  1 sibling, 2 replies; 664+ messages in thread
From: Andrea Arcangeli @ 2009-03-30 14:42 UTC (permalink / raw)
  To: Alan Cox
  Cc: Matthew Garrett, Linus Torvalds, Theodore Tso, Ingo Molnar,
	Jan Kara, Andrew Morton, Arjan van de Ven, Peter Zijlstra,
	Nick Piggin, Jens Axboe, David Rees, Jesper Krogh,
	Linux Kernel Mailing List, Oleg Nesterov, Roland McGrath

On Thu, Mar 26, 2009 at 06:48:38PM +0000, Alan Cox wrote:
> On Thu, 26 Mar 2009 17:53:14 +0000
> Matthew Garrett <mjg@redhat.com> wrote:
> 
> > Change the default behaviour of the kernel to use relatime for all
> > filesystems. This can be overridden with the "strictatime" mount
> > option.
> 
> NAK this again

NAK but because if we change the default it is better to change it to
the real thing: noatime.

I think this can be solved in userland but perhaps changing this in
kernel would be a stronger message that atime is officially
obsoleted. (and nothing will break, not even mutt users will notice,
and if they really do it won't be anything more than aesthetical)


About the open(destination, O_TRUNC); write; close, I think it's not
worth changing the kernel or the VM in any way to hide buggy
programming like that, to the contrary it's great it was found early
on (instead of being filed as some obscure not reproducible bug lost
in some bugzilla and hitting once in a while with an unlucky
power-loss during boot). But solving this bug with fsync so it works
for writeback mode too, would make me prefer to gamble and run the the
buggy version ;). Not sure if it worth providing any ordering
guarantee more than 'ordered' mode in the long term or some proper
barrier, but at least ordered mode already allows for renaming the
tempfile to be enough and that is clearly the best tradeoff. fsync
really should be used only to avoid total loss of information (like
when we need to avoid losing the delivery of an email after the smpt
client is told the email was already received by the smtp server).

Using fsync to tell the kernel in what order to write is dirty
pagecache data to disk, is as inefficient as driving a car to travel a
10 meters distance, so rightfully people isn't using it for this even
if it's the only way it could work for writeback and ext2 too.

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-30 14:00                                                                                           ` Ric Wheeler
@ 2009-03-30 14:44                                                                                             ` Mark Lord
  2009-03-30 14:58                                                                                               ` Ric Wheeler
  2009-03-30 15:00                                                                                               ` Jeff Garzik
  0 siblings, 2 replies; 664+ messages in thread
From: Mark Lord @ 2009-03-30 14:44 UTC (permalink / raw)
  To: Ric Wheeler
  Cc: Andreas T.Auer, Alan Cox, Theodore Tso, Stefan Richter,
	Jeff Garzik, Linus Torvalds, Matthew Garrett, Andrew Morton,
	David Rees, Jesper Krogh, Linux Kernel Mailing List

Ric Wheeler wrote:
> Mark Lord wrote:
>> Ric Wheeler wrote:
..
>> The kernel can crash, and the drives, in practice, will still
>> flush their caches to media by themselves.  Within a second or two.
> 
> Even with desktops, I am not positive that the drive write cache 
> survives a kernel crash without data loss. If I remember correctly, 
> Chris's tests used crashes (not power outages) to display the data 
> corruption that happened without barriers being enabled properly.
..

Linux f/s barriers != drive write caches.

Drive write caches are an almost total non-issue for desktop users,
except on the (very rare) event of a total, sudden power failure
during extended write outs.

Very rare.  Yes, a huge problem for server farms.  No question.
But the majority of Linux systems are probably (still) desktops/notebooks.

Cheers

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: [PATCH 2/2] Make relatime default
  2009-03-30 14:42                                                   ` Andrea Arcangeli
@ 2009-03-30 14:52                                                     ` Xavier Bestel
  2009-03-30 19:26                                                     ` Bill Davidsen
  1 sibling, 0 replies; 664+ messages in thread
From: Xavier Bestel @ 2009-03-30 14:52 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Alan Cox, Matthew Garrett, Linus Torvalds, Theodore Tso,
	Ingo Molnar, Jan Kara, Andrew Morton, Arjan van de Ven,
	Peter Zijlstra, Nick Piggin, Jens Axboe, David Rees,
	Jesper Krogh, Linux Kernel Mailing List, Oleg Nesterov,
	Roland McGrath

On Mon, 2009-03-30 at 16:42 +0200, Andrea Arcangeli wrote:
> On Thu, Mar 26, 2009 at 06:48:38PM +0000, Alan Cox wrote:
> > On Thu, 26 Mar 2009 17:53:14 +0000
> > Matthew Garrett <mjg@redhat.com> wrote:
> > 
> > > Change the default behaviour of the kernel to use relatime for all
> > > filesystems. This can be overridden with the "strictatime" mount
> > > option.
> > 
> > NAK this again
> 
> NAK but because if we change the default it is better to change it to
> the real thing: noatime.
> 
> I think this can be solved in userland but perhaps changing this in
> kernel would be a stronger message that atime is officially
> obsoleted. (and nothing will break, not even mutt users will notice,
> and if they really do it won't be anything more than aesthetical)

Actually I liked the previous relatime (without the 24h hack) pretty
much. It would have preserved the atime functionality without making
something like slocate dirty huge parts of the fs daily.

I vote for relatime-without-24h-hack !

	Xav



^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-30 14:44                                                                                             ` Mark Lord
@ 2009-03-30 14:58                                                                                               ` Ric Wheeler
  2009-03-30 15:21                                                                                                 ` Mark Lord
  2009-03-30 15:00                                                                                               ` Jeff Garzik
  1 sibling, 1 reply; 664+ messages in thread
From: Ric Wheeler @ 2009-03-30 14:58 UTC (permalink / raw)
  To: Mark Lord
  Cc: Andreas T.Auer, Alan Cox, Theodore Tso, Stefan Richter,
	Jeff Garzik, Linus Torvalds, Matthew Garrett, Andrew Morton,
	David Rees, Jesper Krogh, Linux Kernel Mailing List

Mark Lord wrote:
> Ric Wheeler wrote:
>> Mark Lord wrote:
>>> Ric Wheeler wrote:
> ..
>>> The kernel can crash, and the drives, in practice, will still
>>> flush their caches to media by themselves.  Within a second or two.
>>
>> Even with desktops, I am not positive that the drive write cache 
>> survives a kernel crash without data loss. If I remember correctly, 
>> Chris's tests used crashes (not power outages) to display the data 
>> corruption that happened without barriers being enabled properly.
> ..
> 
> Linux f/s barriers != drive write caches.
> 
> Drive write caches are an almost total non-issue for desktop users,
> except on the (very rare) event of a total, sudden power failure
> during extended write outs.
> 
> Very rare.  Yes, a huge problem for server farms.  No question.
> But the majority of Linux systems are probably (still) desktops/notebooks.
> 
> Cheers

I am confused as to why you think that barriers (flush barriers specifically) 
are not equivalent to drive write cache. We disable barriers when the write 
cache is off, use them only to insure that our ordering for fs transactions 
survives any power loss. No one should be enabling barriers on linux file 
systems if your write cache is disabled or if you have a battery backed write 
cache (say on an enterprise class disk array).

Chris' test of barriers (with write cache enabled) did show for desktop class 
boxes that you would get file system corruption (i.e., need to fsck the disk) a 
huge percentage of the time.

Sudden power failures are not rare for desktops in my personal experience, I see 
them several times a year in New England both at home (ice, tree limbs, etc) or 
at work (unplanned outages for repair, broken AC, etc).

Ric


^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-30 14:44                                                                                             ` Mark Lord
  2009-03-30 14:58                                                                                               ` Ric Wheeler
@ 2009-03-30 15:00                                                                                               ` Jeff Garzik
  1 sibling, 0 replies; 664+ messages in thread
From: Jeff Garzik @ 2009-03-30 15:00 UTC (permalink / raw)
  To: Mark Lord
  Cc: Ric Wheeler, Andreas T.Auer, Alan Cox, Theodore Tso,
	Stefan Richter, Linus Torvalds, Matthew Garrett, Andrew Morton,
	David Rees, Jesper Krogh, Linux Kernel Mailing List

Mark Lord wrote:
> Ric Wheeler wrote:
>> Mark Lord wrote:
>>> Ric Wheeler wrote:
> ..
>>> The kernel can crash, and the drives, in practice, will still
>>> flush their caches to media by themselves.  Within a second or two.
>>
>> Even with desktops, I am not positive that the drive write cache 
>> survives a kernel crash without data loss. If I remember correctly, 
>> Chris's tests used crashes (not power outages) to display the data 
>> corruption that happened without barriers being enabled properly.
> ..
> 
> Linux f/s barriers != drive write caches.
> 
> Drive write caches are an almost total non-issue for desktop users,
> except on the (very rare) event of a total, sudden power failure
> during extended write outs.
> 
> Very rare.

Heck, even I have lost power on a plane, while a laptop in laptop mode 
was flushing out work.  Not that rare.


> Yes, a huge problem for server farms.  No question.
> But the majority of Linux systems are probably (still) desktops/notebooks.

But it doesn't really matter who is what majority, does it?  At the 
present time at least, we have not designated any filesystems "desktop 
only", nor have we declared Linux a desktop-only OS.

Any generalized decision that hurts servers to help desktops would be 
short-sighted.  Robbing Peter, to pay Paul, is no formula for OS success.

	Jeff




^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: [PATCH 1/7] block: Add block_flush_device()
  2009-03-30 12:09                                                   ` [PATCH 1/7] block: Add block_flush_device() Fernando Luis Vázquez Cao
@ 2009-03-30 15:07                                                     ` Bartlomiej Zolnierkiewicz
  2009-03-31  6:09                                                       ` Fernando Luis Vázquez Cao
  2009-03-30 17:34                                                     ` Linus Torvalds
  1 sibling, 1 reply; 664+ messages in thread
From: Bartlomiej Zolnierkiewicz @ 2009-03-30 15:07 UTC (permalink / raw)
  To: Fernando Luis Vázquez Cao
  Cc: Jeff Garzik, Christoph Hellwig, Linus Torvalds, Theodore Tso,
	Ingo Molnar, Alan Cox, Arjan van de Ven, Andrew Morton,
	Peter Zijlstra, Nick Piggin, David Rees, Jesper Krogh,
	Linux Kernel Mailing List, chris.mason, david, tj

On Monday 30 March 2009, Fernando Luis Vázquez Cao wrote:
> This patch adds a helper function that should be used by filesystems that need
> to flush the underlying block device on fsync()/fdatasync().
> 
> Signed-off-by: Fernando Luis Vazquez Cao <fernando@oss.ntt.co.jp>
> ---
> 
> diff -urNp linux-2.6.29-orig/fs/buffer.c linux-2.6.29/fs/buffer.c
> --- linux-2.6.29-orig/fs/buffer.c	2009-03-24 08:12:14.000000000 +0900
> +++ linux-2.6.29/fs/buffer.c	2009-03-30 15:27:04.000000000 +0900
> @@ -165,6 +165,17 @@ void end_buffer_write_sync(struct buffer
>   	put_bh(bh);
>   }
> 
> +/* Issue flush of write caches on the block device */
> +int block_flush_device(struct block_device *bdev)

I don't consider this an improvement over using blkdev_issue_flush().

> +{
> +	int ret = 0;
> +
> +	ret = blkdev_issue_flush(bdev, NULL);

The problem lies in using NULL for error_sector argument which shows
a subtle deficiency of the current implementation/usage of barriers
based on a write cache flushing.

I intend to document the issue with adding the FIXME to the current
users of blkdev_issue_flush() so the problem is at least known and not
forgotten (fixing it would require some work from both block and fs
sides and unfortunately there wasn't even a willingness to discuss
possible solutions few years back when the original code was added).

Thanks,
Bart

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: [PATCH 5/7] vfs: Add  wbcflush sysfs knob to disable storage device writeback cache flushes
  2009-03-30 12:22                                                   ` [PATCH 5/7] vfs: Add wbcflush sysfs knob to disable storage device writeback cache flushes Fernando Luis Vázquez Cao
  2009-03-30 12:36                                                     ` Jens Axboe
@ 2009-03-30 15:14                                                     ` Bartlomiej Zolnierkiewicz
  2009-03-30 17:51                                                       ` Jens Axboe
  1 sibling, 1 reply; 664+ messages in thread
From: Bartlomiej Zolnierkiewicz @ 2009-03-30 15:14 UTC (permalink / raw)
  To: Fernando Luis Vázquez Cao
  Cc: Jeff Garzik, Christoph Hellwig, Linus Torvalds, Theodore Tso,
	Ingo Molnar, Alan Cox, Arjan van de Ven, Andrew Morton,
	Peter Zijlstra, Nick Piggin, David Rees, Jesper Krogh,
	Linux Kernel Mailing List, chris.mason, david, tj

On Monday 30 March 2009, Fernando Luis Vázquez Cao wrote:
> Add a sysfs knob to disable storage device writeback cache flushes.

The horde of casual desktop users (with me included) would probably prefer
having two settings -- one for filesystem barriers and one for fsync().

IOW I prefer higher performance at the cost of risking losing few last
seconds/minutes of work in case of crash / powerfailure but I would still
like to have the filesystem in the consistent state after such accident.

Thanks,
Bart

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: [PATCH 6/7] xfs: propagate issue-flush error code
  2009-03-30 12:33                                                   ` [PATCH 6/7] xfs: propagate issue-flush error code Fernando Luis Vázquez Cao
@ 2009-03-30 15:20                                                     ` Bartlomiej Zolnierkiewicz
  2009-03-31 23:37                                                     ` Dave Chinner
  1 sibling, 0 replies; 664+ messages in thread
From: Bartlomiej Zolnierkiewicz @ 2009-03-30 15:20 UTC (permalink / raw)
  To: Fernando Luis Vázquez Cao
  Cc: Jeff Garzik, Christoph Hellwig, Linus Torvalds, Theodore Tso,
	Ingo Molnar, Alan Cox, Arjan van de Ven, Andrew Morton,
	Peter Zijlstra, Nick Piggin, David Rees, Jesper Krogh,
	Linux Kernel Mailing List, chris.mason, david, tj

On Monday 30 March 2009, Fernando Luis Vázquez Cao wrote:
> blkdev_issue_flush() may fail (i.e. due to media error on FLUSH CACHE
> command execution) so its users should check for the return value.
> 
> (This issues was first spotted Bartlomiej Zolnierkiewicz)
> 
> Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
> Signed-off-by: Fernando Luis Vazquez Cao <fernando@oss.ntt.co.jp>
> ---
> 
> diff -urNp linux-2.6.29-orig/fs/xfs/linux-2.6/xfs_buf.c linux-2.6.29/fs/xfs/linux-2.6/xfs_buf.c
> --- linux-2.6.29-orig/fs/xfs/linux-2.6/xfs_buf.c	2009-03-24 08:12:14.000000000 +0900
> +++ linux-2.6.29/fs/xfs/linux-2.6/xfs_buf.c	2009-03-30 14:44:34.000000000 +0900
> @@ -1446,6 +1446,7 @@ xfs_free_buftarg(
>   {
>   	xfs_flush_buftarg(btp, 1);
>   	if (mp->m_flags & XFS_MOUNT_BARRIER)
> +		/* FIXME: check return value */
>   		xfs_blkdev_issue_flush(btp);
>   	xfs_free_bufhash(btp);
>   	iput(btp->bt_mapping->host);
> diff -urNp linux-2.6.29-orig/fs/xfs/linux-2.6/xfs_super.c linux-2.6.29/fs/xfs/linux-2.6/xfs_super.c
> --- linux-2.6.29-orig/fs/xfs/linux-2.6/xfs_super.c	2009-03-24 08:12:14.000000000 +0900
> +++ linux-2.6.29/fs/xfs/linux-2.6/xfs_super.c	2009-03-30 15:16:42.000000000 +0900
> @@ -721,11 +721,11 @@ xfs_mountfs_check_barriers(xfs_mount_t *
>   	}
>   }
> 
> -void
> +int
>   xfs_blkdev_issue_flush(
>   	xfs_buftarg_t		*buftarg)
>   {
> -	blkdev_issue_flush(buftarg->bt_bdev, NULL);
> +	return block_flush_device(buftarg->bt_bdev);
>   }
> 
>   STATIC void
> diff -urNp linux-2.6.29-orig/fs/xfs/linux-2.6/xfs_super.h linux-2.6.29/fs/xfs/linux-2.6/xfs_super.h
> --- linux-2.6.29-orig/fs/xfs/linux-2.6/xfs_super.h	2009-03-24 08:12:14.000000000 +0900
> +++ linux-2.6.29/fs/xfs/linux-2.6/xfs_super.h	2009-03-30 14:46:31.000000000 +0900
> @@ -89,7 +89,7 @@ struct block_device;
> 
>   extern __uint64_t xfs_max_file_offset(unsigned int);
> 
> -extern void xfs_blkdev_issue_flush(struct xfs_buftarg *);
> +extern int xfs_blkdev_issue_flush(struct xfs_buftarg *);
> 
>   extern const struct export_operations xfs_export_operations;
>   extern struct xattr_handler *xfs_xattr_handlers[];
> diff -urNp linux-2.6.29-orig/fs/xfs/xfs_vnodeops.c linux-2.6.29/fs/xfs/xfs_vnodeops.c
> --- linux-2.6.29-orig/fs/xfs/xfs_vnodeops.c	2009-03-24 08:12:14.000000000 +0900
> +++ linux-2.6.29/fs/xfs/xfs_vnodeops.c	2009-03-30 15:08:21.000000000 +0900
> @@ -678,20 +678,20 @@ xfs_fsync(
>   		xfs_iunlock(ip, XFS_ILOCK_EXCL);
>   	}
> 
> -	if ((ip->i_mount->m_flags & XFS_MOUNT_BARRIER) && changed) {
> +	if (!error && (ip->i_mount->m_flags & XFS_MOUNT_BARRIER) && changed) {
>   		/*
>   		 * If the log write didn't issue an ordered tag we need
>   		 * to flush the disk cache for the data device now.
>   		 */
>   		if (!log_flushed)
> -			xfs_blkdev_issue_flush(ip->i_mount->m_ddev_targp);
> +			error = xfs_blkdev_issue_flush(ip->i_mount->m_ddev_targp);

This is different from my original patch which preserved the original
error value...

>   		/*
>   		 * If this inode is on the RT dev we need to flush that
>   		 * cache as well.
>   		 */
> -		if (XFS_IS_REALTIME_INODE(ip))
> -			xfs_blkdev_issue_flush(ip->i_mount->m_rtdev_targp);
> +		if (!error && XFS_IS_REALTIME_INODE(ip))
> +			error = xfs_blkdev_issue_flush(ip->i_mount->m_rtdev_targp);

This is also different and is a change in behavior
(it makes sense IMHO but please document it).

Thanks,
Bart

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-30 14:58                                                                                               ` Ric Wheeler
@ 2009-03-30 15:21                                                                                                 ` Mark Lord
  2009-03-30 15:27                                                                                                   ` Ric Wheeler
  0 siblings, 1 reply; 664+ messages in thread
From: Mark Lord @ 2009-03-30 15:21 UTC (permalink / raw)
  To: Ric Wheeler
  Cc: Andreas T.Auer, Alan Cox, Theodore Tso, Stefan Richter,
	Jeff Garzik, Linus Torvalds, Matthew Garrett, Andrew Morton,
	David Rees, Jesper Krogh, Linux Kernel Mailing List

Ric Wheeler wrote:
>
> I am confused as to why you think that barriers (flush barriers 
> specifically) are not equivalent to drive write cache. We disable 
> barriers when the write cache is off, use them only to insure that our 
> ordering for fs transactions survives any power loss. No one should be 
> enabling barriers on linux file systems if your write cache is disabled 
> or if you have a battery backed write cache (say on an enterprise class 
> disk array).
> 
> Chris' test of barriers (with write cache enabled) did show for desktop 
> class boxes that you would get file system corruption (i.e., need to 
> fsck the disk) a huge percentage of the time.
..

Sure, no doubt there.  But it's due to the kernel crash,
not due to the write cache on the drive.

Anything in the drive's write cache very probably made it to the media
within a second or two of arriving there.

So with or without a write cache, the same result should happen
for those tests.  Of course, if you disable barriers *and* write cache,
then you are no longer testing the same kernel code.

I'm not arguing against battery backup or UPSs,
or *for* blindly trusting write caches without reliable power.

Just pointing out that they're not the evil that some folks
seem to believe they are.

Cheers

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: [PATCH 7/7] reiserfs: propagate issue-flush error code
  2009-03-30 12:36                                                   ` [PATCH 7/7] reiserfs: " Fernando Luis Vázquez Cao
@ 2009-03-30 15:25                                                     ` Bartlomiej Zolnierkiewicz
  0 siblings, 0 replies; 664+ messages in thread
From: Bartlomiej Zolnierkiewicz @ 2009-03-30 15:25 UTC (permalink / raw)
  To: Fernando Luis Vázquez Cao
  Cc: Jeff Garzik, Christoph Hellwig, Linus Torvalds, Theodore Tso,
	Ingo Molnar, Alan Cox, Arjan van de Ven, Andrew Morton,
	Peter Zijlstra, Nick Piggin, David Rees, Jesper Krogh,
	Linux Kernel Mailing List, chris.mason, david, tj

On Monday 30 March 2009, Fernando Luis Vázquez Cao wrote:
> blkdev_issue_flush() may fail (i.e. due to media error on FLUSH CACHE
> command execution) so its users should check for the return value.
> 
> Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
> Signed-off-by: Fernando Luis Vazquez Cao <fernando@oss.ntt.co.jp>
> ---
> 
> diff -urNp linux-2.6.29-orig/fs/reiserfs/file.c linux-2.6.29/fs/reiserfs/file.c
> --- linux-2.6.29-orig/fs/reiserfs/file.c	2009-03-24 08:12:14.000000000 +0900
> +++ linux-2.6.29/fs/reiserfs/file.c	2009-03-30 16:19:19.000000000 +0900
> @@ -146,8 +146,9 @@ static int reiserfs_sync_file(struct fil
>   	reiserfs_write_lock(p_s_inode->i_sb);
>   	barrier_done = reiserfs_commit_for_inode(p_s_inode);
>   	reiserfs_write_unlock(p_s_inode->i_sb);
> -	if (barrier_done != 1 && reiserfs_barrier_flush(p_s_inode->i_sb))
> -		blkdev_issue_flush(p_s_inode->i_sb->s_bdev, NULL);
> +	if (!n_err && barrier_done != 1 &&
> +			reiserfs_barrier_flush(p_s_inode->i_sb))
> +		n_err = block_flush_device(p_s_inode->i_sb->s_bdev);

This is again different from my original patch
(the change in behavior should be documented).

Thanks,
Bart

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-30 15:21                                                                                                 ` Mark Lord
@ 2009-03-30 15:27                                                                                                   ` Ric Wheeler
  2009-03-30 16:13                                                                                                     ` Linus Torvalds
  0 siblings, 1 reply; 664+ messages in thread
From: Ric Wheeler @ 2009-03-30 15:27 UTC (permalink / raw)
  To: Mark Lord, Chris Mason
  Cc: Andreas T.Auer, Alan Cox, Theodore Tso, Stefan Richter,
	Jeff Garzik, Linus Torvalds, Matthew Garrett, Andrew Morton,
	David Rees, Jesper Krogh, Linux Kernel Mailing List

Mark Lord wrote:
> Ric Wheeler wrote:
>>
>> I am confused as to why you think that barriers (flush barriers 
>> specifically) are not equivalent to drive write cache. We disable 
>> barriers when the write cache is off, use them only to insure that our 
>> ordering for fs transactions survives any power loss. No one should be 
>> enabling barriers on linux file systems if your write cache is 
>> disabled or if you have a battery backed write cache (say on an 
>> enterprise class disk array).
>>
>> Chris' test of barriers (with write cache enabled) did show for 
>> desktop class boxes that you would get file system corruption (i.e., 
>> need to fsck the disk) a huge percentage of the time.
> ..
> 
> Sure, no doubt there.  But it's due to the kernel crash,
> not due to the write cache on the drive.
> 
> Anything in the drive's write cache very probably made it to the media
> within a second or two of arriving there.

A modern S-ATA drive has up to 32MB of write cache. If you lose power or suffer 
a sudden reboot (that can reset the bus at least), I am pretty sure that your 
above assumption is simply not true.

> 
> So with or without a write cache, the same result should happen
> for those tests.  Of course, if you disable barriers *and* write cache,
> then you are no longer testing the same kernel code.

Here, I still disagree. All of the test that we have done have shown that write 
cache enabled/barriers off will provably result in fs corruption.

It would be great to have Chris revise his earlier barrier/corruption test to 
validate your assumption (not the test that he posted recently).

> 
> I'm not arguing against battery backup or UPSs,
> or *for* blindly trusting write caches without reliable power.
> 
> Just pointing out that they're not the evil that some folks
> seem to believe they are.
> 
> Cheers

I run with write cache and barriers enabled routinely, but would not run without 
working barriers on any desktop box when the drives have write cache enabled 
having spent too many hours watching fsck churn :-)

ric


^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-30 11:17                                                                                       ` Ric Wheeler
  2009-03-30 13:48                                                                                         ` Mark Lord
@ 2009-03-30 15:34                                                                                         ` Linus Torvalds
  2009-03-30 16:11                                                                                           ` Ric Wheeler
  1 sibling, 1 reply; 664+ messages in thread
From: Linus Torvalds @ 2009-03-30 15:34 UTC (permalink / raw)
  To: Ric Wheeler
  Cc: Andreas T.Auer, Alan Cox, Theodore Tso, Mark Lord,
	Stefan Richter, Jeff Garzik, Matthew Garrett, Andrew Morton,
	David Rees, Jesper Krogh, Linux Kernel Mailing List



On Mon, 30 Mar 2009, Ric Wheeler wrote:
> 
> People keep forgetting that storage (even on your commodity s-ata class of
> drives) has very large & volatile cache. The disk firmware can hold writes in
> that cache as long as it wants, reorder its writes into anything that makes
> sense and has no explicit ordering promises.

Well, when it comes to disk caches, it really does make sense to start 
looking at what breaks.

For example, it is obviously true that any half-way modern disk has 
megabytes of caches, and write caching is quite often enabled by default. 

BUT!

The write-caches on disk are rather different in many very fundamental 
ways from the kernel write caches.

One of the differences is that no disk I've ever heard of does write- 
caching for long times, unless it has battery back-up. Yes, yes, you can 
probably find firmware that has some odd starvation issue, and if the disk 
is constantly busy and the access patterns are _just_ right the writes can 
take a long time, but realistically we're talking delaying and re-ordering 
things by milliseconds. We're not talking seconds or tens of seconds.

And that's really quite a _big_ difference in itself. It may not be 
qualitatively all that different (re-ordering is re-ordering, delays are 
delays), but IN PRACTICE there's an absolutely huge difference between 
delaying and re-ordering writes over milliseconds and doing so over 30s.

The other (huge) difference is that the on-disk write caching generally 
fails only if the drive power fails. Yes, there's a software component to 
it (buggy firmware), but you can really approximate the whole "disk write 
caches didn't get flushed" with "powerfail".

Kernel data caches? Let's be honest. The kernel can fail for a thousand 
different reasons, including very much _any_ component failing, rather 
than just the power supply. But also obviously including bugs.

So when people bring up on-disk caching, it really is a totally different 
thing from the kernel delaying writes.

So it's entirely reasonable to say "leave the disk doing write caching, 
and don't force flushing", while still saying "the kernel should order the 
writes it does".

Thinking that this is somehow a black-and-white issue where "ordered 
writes" always has to imply "cache flush commands" is simply wrong. It is 
_not_ that black-and-white, and it should probably not even be a 
filesystem decision to make (it's a "system" decision).

This, btw, is doubly true simply because if the disk really fails, it's 
entirely possible that it fails in a really nasty way. As in "not only did 
it not write the sector, but the whole track is now totally unreadable 
because power failed while the write head was active".

Because that notion of "power" is not a digital thing - you have 
capacitors, brown-outs, and generally nasty "oops, for a few milliseconds 
the drive still had power, but it was way out of spec, and odd things 
happened".

So quite frankly, if you start worrying about disk power failures, you 
should also then worry about the disk failing in _way_ more spectacular 
ways than just the simple "wrote or wrote not - that is the question".

And when was the last time you saw a "safe" logging filesystem that was 
safe in the face of the log returning IO errors after power comes back on?

Sure, RAID is one answer. Except not so much in 99% of all desktops or 
especially laptops.

			Linus

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-30 15:34                                                                                         ` Linus Torvalds
@ 2009-03-30 16:11                                                                                           ` Ric Wheeler
  2009-03-30 16:34                                                                                             ` Linus Torvalds
  0 siblings, 1 reply; 664+ messages in thread
From: Ric Wheeler @ 2009-03-30 16:11 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andreas T.Auer, Alan Cox, Theodore Tso, Mark Lord,
	Stefan Richter, Jeff Garzik, Matthew Garrett, Andrew Morton,
	David Rees, Jesper Krogh, Linux Kernel Mailing List

Linus Torvalds wrote:
> 
> On Mon, 30 Mar 2009, Ric Wheeler wrote:
>> People keep forgetting that storage (even on your commodity s-ata class of
>> drives) has very large & volatile cache. The disk firmware can hold writes in
>> that cache as long as it wants, reorder its writes into anything that makes
>> sense and has no explicit ordering promises.
> 
> Well, when it comes to disk caches, it really does make sense to start 
> looking at what breaks.
> 
> For example, it is obviously true that any half-way modern disk has 
> megabytes of caches, and write caching is quite often enabled by default. 
> 
> BUT!
> 
> The write-caches on disk are rather different in many very fundamental 
> ways from the kernel write caches.
> 
> One of the differences is that no disk I've ever heard of does write- 
> caching for long times, unless it has battery back-up. Yes, yes, you can 
> probably find firmware that has some odd starvation issue, and if the disk 
> is constantly busy and the access patterns are _just_ right the writes can 
> take a long time, but realistically we're talking delaying and re-ordering 
> things by milliseconds. We're not talking seconds or tens of seconds.
> 
> And that's really quite a _big_ difference in itself. It may not be 
> qualitatively all that different (re-ordering is re-ordering, delays are 
> delays), but IN PRACTICE there's an absolutely huge difference between 
> delaying and re-ordering writes over milliseconds and doing so over 30s.
> 
> The other (huge) difference is that the on-disk write caching generally 
> fails only if the drive power fails. Yes, there's a software component to 
> it (buggy firmware), but you can really approximate the whole "disk write 
> caches didn't get flushed" with "powerfail".
> 
> Kernel data caches? Let's be honest. The kernel can fail for a thousand 
> different reasons, including very much _any_ component failing, rather 
> than just the power supply. But also obviously including bugs.
> 
> So when people bring up on-disk caching, it really is a totally different 
> thing from the kernel delaying writes.
> 
> So it's entirely reasonable to say "leave the disk doing write caching, 
> and don't force flushing", while still saying "the kernel should order the 
> writes it does".

Largely correct above - most disks will gradually destage writes from their 
cache. Large, sequential writes might entirely bypass the write cache and be 
sent (more or less) immediately out to permanent storage.

I still disagree strongly with the don't force flush idea - we have an absolute 
and critical need to have ordered writes that will survive a power failure for 
any file system that is built on transactions (or data base).

The big issues are that for s-ata drives, our flush mechanism is really, really 
primitive and brutal. We could/should try to validate a better and less onerous 
mechanism (with ordering tags? experimental flush ranges? etc).

> Thinking that this is somehow a black-and-white issue where "ordered 
> writes" always has to imply "cache flush commands" is simply wrong. It is 
> _not_ that black-and-white, and it should probably not even be a 
> filesystem decision to make (it's a "system" decision).
> 
> This, btw, is doubly true simply because if the disk really fails, it's 
> entirely possible that it fails in a really nasty way. As in "not only did 
> it not write the sector, but the whole track is now totally unreadable 
> because power failed while the write head was active".

I spent a very long time looking at huge numbers of installed systems (millions 
of file systems deployed in the field), including  taking part in weekly 
analysis of why things failed, whether the rates of failure went up or down with 
a given configuration, etc. so I can fully appreciate all of the ways drives (or 
SSD's!) can magically eat your data.

What you have to keep in mind is the order of magnitude of various buckets of 
failures - software crashes/code bugs tend to dominate, followed by drive 
failures, followed by power supplies, etc.

I have personally seen a huge reduction in the "software" rate of failures when 
you get the write barriers (forced write cache flushing) working properly with a 
very large installed base, tested over many years :-)


> 
> Because that notion of "power" is not a digital thing - you have 
> capacitors, brown-outs, and generally nasty "oops, for a few milliseconds 
> the drive still had power, but it was way out of spec, and odd things 
> happened".
> 
> So quite frankly, if you start worrying about disk power failures, you 
> should also then worry about the disk failing in _way_ more spectacular 
> ways than just the simple "wrote or wrote not - that is the question".

Again, you have to focus on the errors that happen in order of the prevalence.

The number of boxes, over a 3 year period, that have an unexpected power loss is 
much, much higher than the number of boxes that have a disk head crash (probably 
the number one cause of hard disk failure).

I do agree that we need to do other (background) tasks to detect things like the 
  that drives can have (lots of neat terms that give file system people 
nightmare in the drive industry: "adjacent track erasures", "over powered 
seeks", "hi fly writes" just to name my favourites).

Having full checksumming for data blocks and metadata blocks in btrfs will allow 
us to do this kind of background scrubbing pretty naturally, a big win.

> 
> And when was the last time you saw a "safe" logging filesystem that was 
> safe in the face of the log returning IO errors after power comes back on?

This is pretty much a double failure - you need a bad write to the log (or 
undetected media error like the ones I mentioned above) and a power failure/reboot.

As you say, most file systems or data bases will need manual repair or will get 
restored from tape.

That is not the normal case, but we can do surface level scans to try and weed 
out bad media continually during the healthy phase of a boxes life. This can be 
relatively low impact and has a huge positive impact on system reliability.

Any engineer who designs storage system knows that you will have failures - we 
just aim to get the rate of failures down to where you have a fighting chance of 
recovery at a price you can afford...

> 
> Sure, RAID is one answer. Except not so much in 99% of all desktops or 
> especially laptops.
> 
> 			Linus

If you only have one disk, you clearly need a good back up plan of some kind. I 
try to treat my laptop as a carrying vessel for data that I have temporarily on 
it, but is stored somewhere else more stable for when the disk breaks, some kid 
steals it, etc :-)

Ric


^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-30 15:27                                                                                                   ` Ric Wheeler
@ 2009-03-30 16:13                                                                                                     ` Linus Torvalds
  2009-03-30 16:30                                                                                                       ` Mark Lord
  0 siblings, 1 reply; 664+ messages in thread
From: Linus Torvalds @ 2009-03-30 16:13 UTC (permalink / raw)
  To: Ric Wheeler
  Cc: Mark Lord, Chris Mason, Andreas T.Auer, Alan Cox, Theodore Tso,
	Stefan Richter, Jeff Garzik, Matthew Garrett, Andrew Morton,
	David Rees, Jesper Krogh, Linux Kernel Mailing List



On Mon, 30 Mar 2009, Ric Wheeler wrote:
> 
> A modern S-ATA drive has up to 32MB of write cache. If you lose power or
> suffer a sudden reboot (that can reset the bus at least), I am pretty sure
> that your above assumption is simply not true.

At least traditionally, it's worth to note that 32MB of on-disk cache is 
not the same as 32MB of kernel write cache.

The drive caches tend to be more like track caches - you tend to have a 
few large cache entries (segments), not something like a sector cache. And 
I seriously doubt the disk will let you fill them up with writes: it 
likely has things like the sector remapping tables in those caches too.

It's hard to find information about the cache organization of modern 
drives, but at least a few years ago, some of them literally had just a 
single segment, or just a few segments (ie a "8MB cache" might be eight 
segments of one megabyte each).

The reason that matters is that those disks are very good at linear 
throughput.

The latency for writing out eight big segments is likely not really 
noticeably different from the latency of writing out eight single sectors 
spread out across the disk - they both do eight operations, and the 
difference between an op that writes a big chunk of a track and writing a 
single sector isn't necessarily all that noticeable.

So if you have a 8MB drive cache, it's very likely that the drive can 
flush its cache in just a few seeks, and we're still talking milliseconds. 
In contrast, even just 8MB of OS caches could have _hundreds_ of seeks and 
take several seconds to write out.

			Linus

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-30 16:13                                                                                                     ` Linus Torvalds
@ 2009-03-30 16:30                                                                                                       ` Mark Lord
  2009-03-30 16:58                                                                                                         ` Linus Torvalds
  0 siblings, 1 reply; 664+ messages in thread
From: Mark Lord @ 2009-03-30 16:30 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ric Wheeler, Chris Mason, Andreas T.Auer, Alan Cox, Theodore Tso,
	Stefan Richter, Jeff Garzik, Matthew Garrett, Andrew Morton,
	David Rees, Jesper Krogh, Linux Kernel Mailing List

Linus Torvalds wrote:
> 
> On Mon, 30 Mar 2009, Ric Wheeler wrote:
>> A modern S-ATA drive has up to 32MB of write cache. If you lose power or
>> suffer a sudden reboot (that can reset the bus at least), I am pretty sure
>> that your above assumption is simply not true.
> 
> At least traditionally, it's worth to note that 32MB of on-disk cache is 
> not the same as 32MB of kernel write cache.
> 
> The drive caches tend to be more like track caches - you tend to have a 
> few large cache entries (segments), not something like a sector cache. And 
> I seriously doubt the disk will let you fill them up with writes: it 
> likely has things like the sector remapping tables in those caches too.
..

I spent an entire day recently, trying to see if I could significantly fill
up the 32MB cache on a 750GB Hitach SATA drive here.

With deliberate/random write patterns, big and small, near and far,
I could not fill the drive with anything approaching a full second
of latent write-cache flush time.

Not even close.  Which is a pity, because I really wanted to do some testing
related to a deep write cache.  But it just wouldn't happen.

I tried this again on a 16MB cache of a Seagate drive, no difference.

Bummer.  :)

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-30 16:11                                                                                           ` Ric Wheeler
@ 2009-03-30 16:34                                                                                             ` Linus Torvalds
  2009-03-30 17:11                                                                                               ` Ric Wheeler
  2009-03-31 21:10                                                                                               ` Alan Cox
  0 siblings, 2 replies; 664+ messages in thread
From: Linus Torvalds @ 2009-03-30 16:34 UTC (permalink / raw)
  To: Ric Wheeler
  Cc: Andreas T.Auer, Alan Cox, Theodore Tso, Mark Lord,
	Stefan Richter, Jeff Garzik, Matthew Garrett, Andrew Morton,
	David Rees, Jesper Krogh, Linux Kernel Mailing List



On Mon, 30 Mar 2009, Ric Wheeler wrote:
> 
> I still disagree strongly with the don't force flush idea - we have an
> absolute and critical need to have ordered writes that will survive a power
> failure for any file system that is built on transactions (or data base).

Read that sentence of yours again.

In particular, read the "we" part, and ponder.

YOU have that absolute and critical need.

Others? Likely not so much. The reason people run "data=ordered" on their 
laptops is not just because it's the default - rather, it's the default 
_because_ it's the one that avoids most obvious problems. And for 99% of 
all people, that's what they want. 

And as mentioned, if you have to have absolute requirements, you 
absolutely MUST be using real RAID with real protection (not just RAID0). 

Not "should". MUST. If you don't do redundancy, your disk _will_ 
eventually eat your data. Not because the OS wrote in the wrong order, or 
the disk cached writes, but simply because bad things do happen.

But turn that around, and say: if you don't have redundant disks, then 
pretty much by definition those drive flushes won't be guaranteeing your 
data _anyway_, so why pay the price?

> The big issues are that for s-ata drives, our flush mechanism is really,
> really primitive and brutal. We could/should try to validate a better and less
> onerous mechanism (with ordering tags? experimental flush ranges? etc).

That's one of the issues. The cost of those flushes can be really quite 
high, and as mentioned, in the absense of redundancy you don't actually 
get the guarantees that you seem to think that you get.

> I spent a very long time looking at huge numbers of installed systems
> (millions of file systems deployed in the field), including  taking part in
> weekly analysis of why things failed, whether the rates of failure went up or
> down with a given configuration, etc. so I can fully appreciate all of the
> ways drives (or SSD's!) can magically eat your data.

Well, I can go mainly by my own anecdotal evidence, and so far I've 
actually had more catastrophic data failure from failed drives than 
anything else. OS crashes in the middle of a "yum update"? Yup, been 
there, done that, it was really painful. But it was painful in a "damn, I 
need to force a re-install of a couple of rpms".

Actual failed drives that got read errors? I seem to average almost one a 
year. It's been overheating laptops, and it's been power outages that 
apparently happened at really bad times. I have a UPS now.

> What you have to keep in mind is the order of magnitude of various buckets of
> failures - software crashes/code bugs tend to dominate, followed by drive
> failures, followed by power supplies, etc.

Sure. And those "write flushes" really only cover a rather small 
percentage. For many setups, the other corruption issues (drive failure) 
are not just more common, but generally more disastrous anyway. So why 
would a person like that worry about the (rare) power failure?

> I have personally seen a huge reduction in the "software" rate of failures
> when you get the write barriers (forced write cache flushing) working properly
> with a very large installed base, tested over many years :-)

The software rate of failures should only care about the software write 
barriers (ie the ones that order the OS elevator - NOT the ones that 
actually tell the disk to flush itself).

			Linus

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-30 16:30                                                                                                       ` Mark Lord
@ 2009-03-30 16:58                                                                                                         ` Linus Torvalds
  2009-03-30 17:29                                                                                                           ` Mark Lord
  2009-03-30 17:57                                                                                                           ` Chris Mason
  0 siblings, 2 replies; 664+ messages in thread
From: Linus Torvalds @ 2009-03-30 16:58 UTC (permalink / raw)
  To: Mark Lord
  Cc: Ric Wheeler, Chris Mason, Andreas T.Auer, Alan Cox, Theodore Tso,
	Stefan Richter, Jeff Garzik, Matthew Garrett, Andrew Morton,
	David Rees, Jesper Krogh, Linux Kernel Mailing List



On Mon, 30 Mar 2009, Mark Lord wrote:
>
> I spent an entire day recently, trying to see if I could significantly fill
> up the 32MB cache on a 750GB Hitach SATA drive here.
> 
> With deliberate/random write patterns, big and small, near and far,
> I could not fill the drive with anything approaching a full second
> of latent write-cache flush time.
> 
> Not even close.  Which is a pity, because I really wanted to do some testing
> related to a deep write cache.  But it just wouldn't happen.
> 
> I tried this again on a 16MB cache of a Seagate drive, no difference.
> 
> Bummer.  :)

Try it with laptop drives. You might get to a second, or at least hundreds 
of ms (not counting the spinup delay if it went to sleep, obviously). You 
probably tested desktop drives (that 750GB Hitachi one is not a low end 
one, and I assume the Seagate one isn't either).

You'll have a much easier time getting long latencies when seeks take tens 
of ms, and the platter rotates at some pitiful 3600rpm (ok, I guess those 
drives are hard to find these days - I guess 4200rpm is the norm even for 
1.8" laptop harddrives).

And also - this is probably obvious to you, but it might not be 
immediately obvious to everybody - make sure that you do have TCQ going, 
and at full depth. If the drive supports TCQ (and they all do, these days) 
it is quite possible that the drive firmware basically limits the write 
caching to one segment per TCQ entry (or at least to something smallish).

Why? Because that really simplifies some of the problem space for the 
firmware a _lot_ - if you have at least as many segments in your cache as 
your max TCQ depth, it means that you always have one segment free to be 
re-used without any physical IO when a new command comes in.

And if I were a disk firmware engineer, I'd try my damndest to keep my 
problem space simple, so I would do exactly that kind of "limit the number 
of dirty cache segments by the queue size" thing.

But I dunno. You may not want to touch those slow laptop drives with a 
ten-foot pole. It's certainly not my favorite pastime.

			Linus

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-30 16:34                                                                                             ` Linus Torvalds
@ 2009-03-30 17:11                                                                                               ` Ric Wheeler
  2009-03-30 17:39                                                                                                 ` Mark Lord
  2009-03-30 17:51                                                                                                 ` Linus Torvalds
  2009-03-31 21:10                                                                                               ` Alan Cox
  1 sibling, 2 replies; 664+ messages in thread
From: Ric Wheeler @ 2009-03-30 17:11 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andreas T.Auer, Alan Cox, Theodore Tso, Mark Lord,
	Stefan Richter, Jeff Garzik, Matthew Garrett, Andrew Morton,
	David Rees, Jesper Krogh, Linux Kernel Mailing List

Linus Torvalds wrote:
> 
> On Mon, 30 Mar 2009, Ric Wheeler wrote:
>> I still disagree strongly with the don't force flush idea - we have an
>> absolute and critical need to have ordered writes that will survive a power
>> failure for any file system that is built on transactions (or data base).
> 
> Read that sentence of yours again.
> 
> In particular, read the "we" part, and ponder.
> 
> YOU have that absolute and critical need.
> 
> Others? Likely not so much. The reason people run "data=ordered" on their 
> laptops is not just because it's the default - rather, it's the default 
> _because_ it's the one that avoids most obvious problems. And for 99% of 
> all people, that's what they want. 

My "we" is meant to be the file system writers - we build our journalled file 
systems on top of these assumptions about ordering.  Not having them punts this 
all to fsck running most likely in a manual repair.

> 
> And as mentioned, if you have to have absolute requirements, you 
> absolutely MUST be using real RAID with real protection (not just RAID0). 
> 
> Not "should". MUST. If you don't do redundancy, your disk _will_ 
> eventually eat your data. Not because the OS wrote in the wrong order, or 
> the disk cached writes, but simply because bad things do happen.

Simply not true. To build reliable systems, you need reliable components.

It is perfectly normal to build non-raided systems that are components of a 
larger storage pool that don't do raid.

Easy example would be two desktops using rsync, most "cloud" storage systems do 
something similar at the whole file level (i.e., write out my file 3 times).

If you acknowledge back to a client a write, then have a power outage, the 
client should reasonably be able to expect that the data survived the power outage.

> 
> But turn that around, and say: if you don't have redundant disks, then 
> pretty much by definition those drive flushes won't be guaranteeing your 
> data _anyway_, so why pay the price?

They do in fact provide that promise for the extremely common case of power 
outage and as such, can be used to build reliable storage if you need to.

>> The big issues are that for s-ata drives, our flush mechanism is really,
>> really primitive and brutal. We could/should try to validate a better and less
>> onerous mechanism (with ordering tags? experimental flush ranges? etc).
> 
> That's one of the issues. The cost of those flushes can be really quite 
> high, and as mentioned, in the absense of redundancy you don't actually 
> get the guarantees that you seem to think that you get.

I have measured the costs of the write flushes on a variety of devices, 
routinely, a cache flush is on the order of 10-20 ms with a healthy s-ata drive.

Compared to the write speed of writing any large file from DRAM to storage, one 
20ms cost to make sure it is on disk is normally in the noise.

The trade off is clearly not as good for small files.

And I will add, my data is built on years of real data from commodity hardware 
running normal Linux kernels - no special hardware. There are also a lot of good 
papers that the USENIX FAST people have put out (looking at failures in NetApp 
gear, the HPC servers in national labs and at google) that can help provide 
realistic & accurate data.


> 
>> I spent a very long time looking at huge numbers of installed systems
>> (millions of file systems deployed in the field), including  taking part in
>> weekly analysis of why things failed, whether the rates of failure went up or
>> down with a given configuration, etc. so I can fully appreciate all of the
>> ways drives (or SSD's!) can magically eat your data.
> 
> Well, I can go mainly by my own anecdotal evidence, and so far I've 
> actually had more catastrophic data failure from failed drives than 
> anything else. OS crashes in the middle of a "yum update"? Yup, been 
> there, done that, it was really painful. But it was painful in a "damn, I 
> need to force a re-install of a couple of rpms".
> 
> Actual failed drives that got read errors? I seem to average almost one a 
> year. It's been overheating laptops, and it's been power outages that 
> apparently happened at really bad times. I have a UPS now.

Heat is a major killer of spinning drives (as is severe cold). A lot of times, 
drives that have read errors only (not failed writes) might be fully recoverable 
if you can re-write that injured sector. What you should look for is a peak in 
the remapped sectors (via hdparm) - that usually is a moderately good indicator 
(but note that it is normal to have some, just not 10-25% remapped!).

> 
>> What you have to keep in mind is the order of magnitude of various buckets of
>> failures - software crashes/code bugs tend to dominate, followed by drive
>> failures, followed by power supplies, etc.
> 
> Sure. And those "write flushes" really only cover a rather small 
> percentage. For many setups, the other corruption issues (drive failure) 
> are not just more common, but generally more disastrous anyway. So why 
> would a person like that worry about the (rare) power failure?

This is simply not a true statement from what I have seen personally.

> 
>> I have personally seen a huge reduction in the "software" rate of failures
>> when you get the write barriers (forced write cache flushing) working properly
>> with a very large installed base, tested over many years :-)
> 
> The software rate of failures should only care about the software write 
> barriers (ie the ones that order the OS elevator - NOT the ones that 
> actually tell the disk to flush itself).
> 
> 			Linus


The elevator does not issue write barriers on its own - those write barriers are 
sent down by the file systems for transaction commits.

I could be totally confused at this point, but I don't know of any sequential 
ordering needs that CFQ, etc have for their internal needs.

ric



^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-30 16:58                                                                                                         ` Linus Torvalds
@ 2009-03-30 17:29                                                                                                           ` Mark Lord
  2009-03-30 17:57                                                                                                           ` Chris Mason
  1 sibling, 0 replies; 664+ messages in thread
From: Mark Lord @ 2009-03-30 17:29 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ric Wheeler, Chris Mason, Andreas T.Auer, Alan Cox, Theodore Tso,
	Stefan Richter, Jeff Garzik, Matthew Garrett, Andrew Morton,
	David Rees, Jesper Krogh, Linux Kernel Mailing List

Linus Torvalds wrote:
> 
> On Mon, 30 Mar 2009, Mark Lord wrote:
>> I spent an entire day recently, trying to see if I could significantly fill
>> up the 32MB cache on a 750GB Hitach SATA drive here.
>>
>> With deliberate/random write patterns, big and small, near and far,
>> I could not fill the drive with anything approaching a full second
>> of latent write-cache flush time.
>>
>> Not even close.  Which is a pity, because I really wanted to do some testing
>> related to a deep write cache.  But it just wouldn't happen.
>>
>> I tried this again on a 16MB cache of a Seagate drive, no difference.
>>
>> Bummer.  :)
> 
> Try it with laptop drives. You might get to a second, or at least hundreds 
> of ms (not counting the spinup delay if it went to sleep, obviously). You 
> probably tested desktop drives (that 750GB Hitachi one is not a low end 
> one, and I assume the Seagate one isn't either).
> 
> You'll have a much easier time getting long latencies when seeks take tens 
> of ms, and the platter rotates at some pitiful 3600rpm (ok, I guess those 
> drives are hard to find these days - I guess 4200rpm is the norm even for 
> 1.8" laptop harddrives).
> 
> And also - this is probably obvious to you, but it might not be 
> immediately obvious to everybody - make sure that you do have TCQ going, 
> and at full depth. If the drive supports TCQ (and they all do, these days) 
> it is quite possible that the drive firmware basically limits the write 
> caching to one segment per TCQ entry (or at least to something smallish).
..

Oh yes, absolute -- I tried with and without NCQ (the SATA replacement
for old-style TCQ), and with varying NCQ queue depths.  No luck keeping
the darned thing busy flushing afterwards for anything more than
perhaps a few hundred millseconds.  I wasn't really interested in anything
under a second, so I didn't measure it exactly though.

The older and/or slower notebook drives (4200rpm) tend to have smaller
onboard caches, too.  Which makes them difficult to fill.

I suspect I'd have much better "luck" with a slow-ish SSD that has
a largish write cache.  Dunno if those exist, and they'll have to get
cheaper before I pick one up to deliberately bash on.  :)

Cheers

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: [PATCH 1/7] block: Add block_flush_device()
  2009-03-30 12:09                                                   ` [PATCH 1/7] block: Add block_flush_device() Fernando Luis Vázquez Cao
  2009-03-30 15:07                                                     ` Bartlomiej Zolnierkiewicz
@ 2009-03-30 17:34                                                     ` Linus Torvalds
  2009-03-30 17:50                                                       ` Jeff Garzik
  2009-03-30 17:55                                                       ` Jens Axboe
  1 sibling, 2 replies; 664+ messages in thread
From: Linus Torvalds @ 2009-03-30 17:34 UTC (permalink / raw)
  To: Fernando Luis Vázquez Cao, Jens Axboe
  Cc: Jeff Garzik, Christoph Hellwig, Theodore Tso, Ingo Molnar,
	Alan Cox, Arjan van de Ven, Andrew Morton, Peter Zijlstra,
	Nick Piggin, David Rees, Jesper Krogh, Linux Kernel Mailing List,
	chris.mason, david, tj



On Mon, 30 Mar 2009, Fernando Luis Vázquez Cao wrote:
> +	int ret = 0;
> +
> +	ret = blkdev_issue_flush(bdev, NULL);
> +
> +	return (ret == -EOPNOTSUPP) ? 0 : ret;

Btw, why do we do that silly EOPNOTSUPP at all?

If the device doesn't support flushing, we should

 - set a flag in the device saying so, and not ever try to flush again on 
   that device (who knows how long it took for the device to say "I can't 
   do this"? We don't want to keep on doing it)

 - return "done". There's nothing sane the caller can do with the error 
   code anyway, it just has to assume that the device basically doesn't 
   reorder writes.

So wouldn't it be better to just fix blkdev_issue_flush() to not do those 
crazy error codes?

[ The same thing probably goes for those ENXIO errors, btw. If we don't 
  have a bd_disk or a queue, why would the caller care about it? ]

Jens?

			Linus

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-30 17:11                                                                                               ` Ric Wheeler
@ 2009-03-30 17:39                                                                                                 ` Mark Lord
  2009-03-30 17:51                                                                                                 ` Linus Torvalds
  1 sibling, 0 replies; 664+ messages in thread
From: Mark Lord @ 2009-03-30 17:39 UTC (permalink / raw)
  To: Ric Wheeler
  Cc: Linus Torvalds, Andreas T.Auer, Alan Cox, Theodore Tso,
	Stefan Richter, Jeff Garzik, Matthew Garrett, Andrew Morton,
	David Rees, Jesper Krogh, Linux Kernel Mailing List

Ric Wheeler wrote:
> Linus Torvalds wrote:
..
>> That's one of the issues. The cost of those flushes can be really 
>> quite high, and as mentioned, in the absense of redundancy you don't 
>> actually get the guarantees that you seem to think that you get.
> 
> I have measured the costs of the write flushes on a variety of devices, 
> routinely, a cache flush is on the order of 10-20 ms with a healthy 
> s-ata drive.
..

Err, no.  Yes, the flush itself will be very quick,
since the drive is nearly always keeping up with the I/O
already (as we are discussing in a separate subthread here!).

But.. the cost of that FLUSH_CACHE command can be quite significant.
To issue it, we first have to stop accepting R/W requests,
and then wait for up to 32 of them currently in-flight to complete.
Then issue the cache-flush, and wait for that to complete.

Then resume R/W again.

And FLUSH_CACHE is a PIO command for most libata hosts,
so it has a multi-microsecond CPU hit as well as the I/O hit,
whereas regular R/W commands will usually use less CPU because
they are usually done via an automated host command queue.

Tiny, but significant.  And more so on smaller/slower end-user systems
like netbooks than on datacenter servers, perhaps.

Cheers

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-30 12:55                                                                               ` Chris Mason
@ 2009-03-30 17:42                                                                                 ` Theodore Tso
  2009-03-31 23:55                                                                                 ` Dave Chinner
  1 sibling, 0 replies; 664+ messages in thread
From: Theodore Tso @ 2009-03-30 17:42 UTC (permalink / raw)
  To: Chris Mason
  Cc: Dave Chinner, Mark Lord, Stefan Richter, Jeff Garzik,
	Linus Torvalds, Matthew Garrett, Alan Cox, Andrew Morton,
	David Rees, Jesper Krogh, Linux Kernel Mailing List

On Mon, Mar 30, 2009 at 08:55:51AM -0400, Chris Mason wrote:
> Sorry, I'm afraid that rsync falls into the same category as the
> kde/gnome apps here.
> 
> There are a lot of backup programs built around rsync, and every one of
> them risks losing the old copy of the file by renaming an unflushed new
> copy over it.
> 
> rsync needs the flushing about a million times more than gnome and kde,
> and it doesn't have any option to do it automatically.  It does have the
> option to create backups, which is how a percentage of people are using
> it, but I wouldn't call its current setup safe outside of ext3.

I wouldn't make it to be the default, but as an option, if the backup
script would take responsibility for restarting rsync if the server
crashes, and if the rsync process executes a global sync(2) call when
it is complete, an option to make rsync delete the target file before
doing the rename to defeat the replace-via-rename hueristic could be
justifiable.

						- Ted

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: [PATCH 1/7] block: Add block_flush_device()
  2009-03-30 17:34                                                     ` Linus Torvalds
@ 2009-03-30 17:50                                                       ` Jeff Garzik
  2009-03-30 17:55                                                       ` Jens Axboe
  1 sibling, 0 replies; 664+ messages in thread
From: Jeff Garzik @ 2009-03-30 17:50 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Fernando Luis Vázquez Cao, Jens Axboe, Christoph Hellwig,
	Theodore Tso, Ingo Molnar, Alan Cox, Arjan van de Ven,
	Andrew Morton, Peter Zijlstra, Nick Piggin, David Rees,
	Jesper Krogh, Linux Kernel Mailing List, chris.mason, david, tj

Linus Torvalds wrote:
> So wouldn't it be better to just fix blkdev_issue_flush() to not do those 
> crazy error codes?

Yes, much.

	Jeff



^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: [PATCH 5/7] vfs: Add  wbcflush sysfs knob to disable storage device writeback cache flushes
  2009-03-30 15:14                                                     ` Bartlomiej Zolnierkiewicz
@ 2009-03-30 17:51                                                       ` Jens Axboe
  2009-03-30 17:55                                                         ` Jeff Garzik
  0 siblings, 1 reply; 664+ messages in thread
From: Jens Axboe @ 2009-03-30 17:51 UTC (permalink / raw)
  To: Bartlomiej Zolnierkiewicz
  Cc: Fernando Luis Vázquez Cao, Jeff Garzik, Christoph Hellwig,
	Linus Torvalds, Theodore Tso, Ingo Molnar, Alan Cox,
	Arjan van de Ven, Andrew Morton, Peter Zijlstra, Nick Piggin,
	David Rees, Jesper Krogh, Linux Kernel Mailing List, chris.mason,
	david, tj

On Mon, Mar 30 2009, Bartlomiej Zolnierkiewicz wrote:
> On Monday 30 March 2009, Fernando Luis Vázquez Cao wrote:
> > Add a sysfs knob to disable storage device writeback cache flushes.
> 
> The horde of casual desktop users (with me included) would probably prefer
> having two settings -- one for filesystem barriers and one for fsync().
> 
> IOW I prefer higher performance at the cost of risking losing few last
> seconds/minutes of work in case of crash / powerfailure but I would still
> like to have the filesystem in the consistent state after such accident.

The knob is meant to control whether we really need to send a flush to
the device or not, so it's an orthogonal issue to what you are talking
about. For battery backed caches, we never need to flush. This knob is
useful IFF we have devices with write back caches that STILL do a cache
flush.

As such, I'd also prefer waiting with adding such a knob until such a
device has actually be observed. No point in adding something just in
case it may exist. And even then, it's probably even better handled in
the driver.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-30 17:11                                                                                               ` Ric Wheeler
  2009-03-30 17:39                                                                                                 ` Mark Lord
@ 2009-03-30 17:51                                                                                                 ` Linus Torvalds
  2009-03-30 18:15                                                                                                   ` Ric Wheeler
                                                                                                                     ` (2 more replies)
  1 sibling, 3 replies; 664+ messages in thread
From: Linus Torvalds @ 2009-03-30 17:51 UTC (permalink / raw)
  To: Ric Wheeler
  Cc: Andreas T.Auer, Alan Cox, Theodore Tso, Mark Lord,
	Stefan Richter, Jeff Garzik, Matthew Garrett, Andrew Morton,
	David Rees, Jesper Krogh, Linux Kernel Mailing List



On Mon, 30 Mar 2009, Ric Wheeler wrote:
> >
> > But turn that around, and say: if you don't have redundant disks, then
> > pretty much by definition those drive flushes won't be guaranteeing your
> > data _anyway_, so why pay the price?
> 
> They do in fact provide that promise for the extremely common case of power
> outage and as such, can be used to build reliable storage if you need to.

No they really effectively don't. Not if the end result is "oops, the 
whole track is now unreadable" (regardless of whether it happened due to a 
write durign power-out or during some entirely unrelated disk error). Your 
"flush" didn't result in a stable filesystem at all, it just resulted in a 
dead one.

That's my point. Disks simply aren't that reliable. Anything you do with 
flushing and ordering won't make them magically not have errors any more.

> Heat is a major killer of spinning drives (as is severe cold). A lot of times,
> drives that have read errors only (not failed writes) might be fully
> recoverable if you can re-write that injured sector.

It's not worked for me, and yes, I've tried. Maybe I've been unlucky, but 
every single case I can remember of having read failures, that drive has 
been dead. Trying to re-write just the sectors with the error (and around 
it) didn't do squat, and rewriting the whole disk didn't work either.

I'm sure it works for some "ok, the write just failed to take, and the CRC 
was bad" case, but that's apparently not what I've had. I suspect either 
the track markers got overwritten (and maybe a disk-specific low-level 
reformat would have helped, but at that point I was not going to trust the 
drive anyway, so I didn't care), or there was actual major physical damage 
due to heat and/or head crash and remapping was just not able to cope.

> > Sure. And those "write flushes" really only cover a rather small percentage.
> > For many setups, the other corruption issues (drive failure) are not just
> > more common, but generally more disastrous anyway. So why would a person
> > like that worry about the (rare) power failure?
> 
> This is simply not a true statement from what I have seen personally.

You yourself said that software errors were your biggest issue. The write 
flush wouldn't matter for those (but the elevator barrier would)

> The elevator does not issue write barriers on its own - those write barriers
> are sent down by the file systems for transaction commits.

Right. But "elevator write barrier" vs "sending a drive flush command" are 
two totally independent issues. You can do one without the other (although 
doing a drive flush command without the write barrier is admittedly kind 
of pointless ;^)

And my point is, IT MAKES SENSE to just do the elevator barrier, _without_ 
the drive command. If you worry much more about software (or non-disk 
component) failure than about power failures, you're better off just doing 
the software-level synchronization, and leaving the hardware alone.

			Linus

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: [PATCH 5/7] vfs: Add  wbcflush sysfs knob to disable storage device writeback cache flushes
  2009-03-30 17:51                                                       ` Jens Axboe
@ 2009-03-30 17:55                                                         ` Jeff Garzik
  2009-03-30 17:59                                                           ` Jens Axboe
  0 siblings, 1 reply; 664+ messages in thread
From: Jeff Garzik @ 2009-03-30 17:55 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Bartlomiej Zolnierkiewicz, Fernando Luis Vázquez Cao,
	Christoph Hellwig, Linus Torvalds, Theodore Tso, Ingo Molnar,
	Alan Cox, Arjan van de Ven, Andrew Morton, Peter Zijlstra,
	Nick Piggin, David Rees, Jesper Krogh, Linux Kernel Mailing List,
	chris.mason, david, tj

Jens Axboe wrote:
> On Mon, Mar 30 2009, Bartlomiej Zolnierkiewicz wrote:
>> On Monday 30 March 2009, Fernando Luis Vázquez Cao wrote:
>>> Add a sysfs knob to disable storage device writeback cache flushes.
>> The horde of casual desktop users (with me included) would probably prefer
>> having two settings -- one for filesystem barriers and one for fsync().
>>
>> IOW I prefer higher performance at the cost of risking losing few last
>> seconds/minutes of work in case of crash / powerfailure but I would still
>> like to have the filesystem in the consistent state after such accident.
> 
> The knob is meant to control whether we really need to send a flush to
> the device or not, so it's an orthogonal issue to what you are talking
> about. For battery backed caches, we never need to flush. This knob is
> useful IFF we have devices with write back caches that STILL do a cache
> flush.

How do installers and/or kernels detect a battery-backed cache that does 
not need flush?

	Jeff




^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: [PATCH 1/7] block: Add block_flush_device()
  2009-03-30 17:34                                                     ` Linus Torvalds
  2009-03-30 17:50                                                       ` Jeff Garzik
@ 2009-03-30 17:55                                                       ` Jens Axboe
  2009-03-30 18:27                                                         ` Linus Torvalds
  1 sibling, 1 reply; 664+ messages in thread
From: Jens Axboe @ 2009-03-30 17:55 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Fernando Luis Vázquez Cao, Jeff Garzik, Christoph Hellwig,
	Theodore Tso, Ingo Molnar, Alan Cox, Arjan van de Ven,
	Andrew Morton, Peter Zijlstra, Nick Piggin, David Rees,
	Jesper Krogh, Linux Kernel Mailing List, chris.mason, david, tj

On Mon, Mar 30 2009, Linus Torvalds wrote:
> 
> 
> On Mon, 30 Mar 2009, Fernando Luis Vázquez Cao wrote:
> > +	int ret = 0;
> > +
> > +	ret = blkdev_issue_flush(bdev, NULL);
> > +
> > +	return (ret == -EOPNOTSUPP) ? 0 : ret;
> 
> Btw, why do we do that silly EOPNOTSUPP at all?
> 
> If the device doesn't support flushing, we should
> 
>  - set a flag in the device saying so, and not ever try to flush again on
>    that device (who knows how long it took for the device to say "I can't
>    do this"? We don't want to keep on doing it)
> 
>  - return "done". There's nothing sane the caller can do with the error
>    code anyway, it just has to assume that the device basically doesn't
>    reorder writes.
> 
> So wouldn't it be better to just fix blkdev_issue_flush() to not do those
> crazy error codes?

The problem is that we may not know upfront, so it sort-of has to be
this trial approach where the first barrier issued will notice and fail
with -EOPNOTSUPP. Sure, we could cache this value, but it's pretty
pointless since the filesystem will stop sending barriers in this case.
As it also modifed fs behaviour, we need to pass the info back.

For blkdev_issue_flush() it may not be very interesting, since there's
not much we can do about that. Just seems like very bad style to NOT
return an error in such a case. You can assume that ordering is fine,
but it definitely wont be in all case (eg devices that have write back
caching on by default and don't support flush). So the nice thing to do
there is actually tell the caller about it. So the same error is reused
as we do for actualy write barriers that have data attached.

> [ The same thing probably goes for those ENXIO errors, btw. If we don't
>   have a bd_disk or a queue, why would the caller care about it? ]

Right, that is pretty pointless.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-30 16:58                                                                                                         ` Linus Torvalds
  2009-03-30 17:29                                                                                                           ` Mark Lord
@ 2009-03-30 17:57                                                                                                           ` Chris Mason
  2009-03-30 18:39                                                                                                             ` Mark Lord
  2009-03-30 18:54                                                                                                             ` Pasi Kärkkäinen
  1 sibling, 2 replies; 664+ messages in thread
From: Chris Mason @ 2009-03-30 17:57 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Mark Lord, Ric Wheeler, Andreas T.Auer, Alan Cox, Theodore Tso,
	Stefan Richter, Jeff Garzik, Matthew Garrett, Andrew Morton,
	David Rees, Jesper Krogh, Linux Kernel Mailing List

[-- Attachment #1: Type: text/plain, Size: 3440 bytes --]

On Mon, 2009-03-30 at 09:58 -0700, Linus Torvalds wrote:
> 
> On Mon, 30 Mar 2009, Mark Lord wrote:
> >
> > I spent an entire day recently, trying to see if I could significantly fill
> > up the 32MB cache on a 750GB Hitach SATA drive here.
> > 
> > With deliberate/random write patterns, big and small, near and far,
> > I could not fill the drive with anything approaching a full second
> > of latent write-cache flush time.
> > 
> > Not even close.  Which is a pity, because I really wanted to do some testing
> > related to a deep write cache.  But it just wouldn't happen.
> > 
> > I tried this again on a 16MB cache of a Seagate drive, no difference.
> > 
> > Bummer.  :)
> 
> Try it with laptop drives. You might get to a second, or at least hundreds 
> of ms (not counting the spinup delay if it went to sleep, obviously). You 
> probably tested desktop drives (that 750GB Hitachi one is not a low end 
> one, and I assume the Seagate one isn't either).

I had some fun trying things with this, and I've been able to reliably
trigger stalls in write cache of ~60 seconds on my seagate 500GB sata
drive.  The worst I saw was 214 seconds.

It took a little experimentation, and I had to switch to the noop
scheduler (no idea why).  

Also, I had to watch vmstat closely.  When the test first started,
vmstat was reporting 500kb/s or so write throughput.  After the test ran
for a few minutes, vmstat jumped up to 8MB/s.

My guess is that the drive has some internal threshold for when it
decides to only write in cache.  The switch to 8MB/s is when it switched
to cache only goodness.  Or perhaps the attached program is buggy and
I'll end up looking silly...it was some quick coding.

The test forks two procs.  One proc does 4k writes to the first 26MB of
the test file (/dev/sdb for me).  These writes are O_DIRECT, and use a
block size of 4k.

The idea is that we fill the cache with work that is very beneficial to
keep in cache, but that the drive will tend to flush out because it is
filling up tracks.

The second proc O_DIRECT writes to two adjacent sectors far away from
the hot writes from the first proc, and it puts in a timestamp from just
before the write.  Every second or so, this timestamp is printed to
stderr.  The drive will want to keep these two sectors in cache because
we are constantly overwriting them.

(It's worth mentioning this is a destructive test.  Running it
on /dev/sdb will overwrite the first 64MB of the drive!!!!)

Sample output:

# ./wb-latency /dev/sdb
Found tv 1238434622.461527
starting hot writes run
starting tester run
current time 1238435045.529751
current time 1238435046.531250
...
current time 1238435063.772456
current time 1238435064.788639
current time 1238435065.814101
current time 1238435066.847704

Right here, I pull the power cord.  The box comes back up, and I run:

# ./wb-latency -c /dev/sdb
Found tv 1238435067.347829

When -c is passed, it just reads the timestamp out of the timestamp
block and exits.  You compare this value with the value printed just
before you pulled the block.

For the run here, the two values are within .5s of each other.  The
tester only prints the time every one second, so anything that close is
very good.  I had pulled the plug before the drive got into that fast
8MB/s mode, so the drive was doing a pretty good job of fairly servicing
the cache.

My drive has a cache of 32MB.  Smaller caches probably need a smaller
hot zone.

-chris


[-- Attachment #2: wb-latency.c --]
[-- Type: text/x-csrc, Size: 4378 bytes --]

/*
 * wb-latency.c
 *
 * This file may be redistributed under the terms of the GNU Public
 * License, version 2.
 */
#define _FILE_OFFSET_BITS 64
#define _XOPEN_SOURCE 600
#include <unistd.h>
#include <stdlib.h>
#include <stdio.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <sys/time.h>
#include <sys/wait.h>
#include <signal.h>
#include <time.h>
#include <fcntl.h>
#include <string.h>

#ifndef O_DIRECT
#define O_DIRECT         040000 /* direct disk access hint */
#endif

static int page_size = 4096;

static float timeval_subtract(struct timeval *tv1, struct timeval *tv2)
{
	return ((tv1->tv_sec - tv2->tv_sec) +
		((float) (tv1->tv_usec - tv2->tv_usec)) / 1000000);
}

/*
 * the magic offset is where we write our timestamps.
 * The idea is that we write constantly to the magic offset
 * and then pull the power.
 * After the OS comes back, we read the timestamp stored and compare
 * it with the time stamp printed.  Any difference over 1s is time the
 * IO spent stalled in cache.
 */
static loff_t magic_offset(loff_t total)
{
	loff_t cur = total - ((loff_t)64) * 1024;
	cur = cur / page_size;
	cur = cur * page_size;
	return cur;
}

/*
 * this function runs in a loop overwriting two nearby
 * sectors.  The idea is to create something the
 * drive is likely to store in cache and not send down very often.
 *
 * It writes a timestamp to the sector and to stderr.  After
 * crashing, compare the output of wb-latency -c with the last
 * thing printed on stderr.
 */
static void timestamp_io(int fd, char *buf, loff_t total)
{
	loff_t cur = magic_offset(total);
	struct timeval tv;
	struct timeval print_tv;
	int ret;

	cur = cur / page_size;
	cur = cur * page_size;

	printf("starting tester run\n");
	gettimeofday(&print_tv, NULL);
	while(1) {
		gettimeofday(&tv, NULL);
		memcpy(buf, &tv, sizeof(tv));

		if (timeval_subtract(&tv, &print_tv) >= 1) {
			fprintf(stderr, "current time %lu.%lu\n",
				tv.tv_sec, tv.tv_usec);
			gettimeofday(&print_tv, NULL);
		}

		ret = pwrite(fd, buf, page_size, cur);
		if (ret < page_size) {
			fprintf(stderr, "short write ret %d cur %llu\n",
				ret, (unsigned long long)cur);
			exit(1);
		}

		ret = pwrite(fd, buf, page_size, cur + page_size * 2);
		if (ret < page_size) {
			fprintf(stderr, "short write ret %d cur %llu\n",
				ret, (unsigned long long)cur);
			exit(1);
		}

	}
}

/*
 * just print out the timestamp in our magic sector
 */
static void check_timestamp_io(int fd, char *buf, loff_t total)
{
	int ret;
	struct timeval tv;
	loff_t cur = magic_offset(total);

	ret = pread(fd, buf, page_size, cur);
	if (ret < page_size) {
		perror("read");
		exit(1);
	}
	memcpy(&tv, buf, sizeof(tv));
	printf("Found tv %lu.%lu\n", tv.tv_sec, tv.tv_usec);
}

int main(int argc, char **argv)
{
	int	fd;
	struct stat st;
	pid_t pid;
	int ret;
	int i;
	int status;
	loff_t total_size = 128 * 1024 * 1024;
	loff_t hot_size = 26 * 1024 * 1024;
	loff_t cur;
	char *buf;
	char *filename = NULL;
	int check_only = 0;

	ret = posix_memalign((void *)(&buf), page_size, page_size);
	if (ret) {
		perror("memalign\n");
		exit(1);
	}

	memset(buf, 0, page_size);

	if (argc < 2) {
		fprintf(stderr, "usage: wb-latency [-c] file\n");
		exit(1);
	}
	for (i = 1; i < argc; i++) {
		if (strcmp(argv[i], "-c") == 0)
			check_only = 1;
		else
			filename = argv[i];
	}

	fd = open(filename, O_RDWR | O_DIRECT | O_CREAT);
	if (fd < 0) {
		perror("open");
		exit(1);
	}

	ret = fstat(fd, &st);
	if (ret < 0) {
		perror("fstat");
		exit(1);
	}

	check_timestamp_io(fd, buf, total_size);

	if (check_only)
		exit(0);

	/* setup the file if we aren't doing a block device */
	if (!S_ISBLK(st.st_mode) && st.st_size < total_size) {
		printf("setting up file %s\n", filename);
		while(cur < total_size) {
			ret = write(fd, buf, page_size);
			if (ret <= 0) {
				fprintf(stderr, "short write\n");
				exit(1);
			}
			cur += ret;
		}
		printf("done setting up %s\n", filename);
	}

	pid = fork();
	if (pid == 0) {
		timestamp_io(fd, buf, total_size);
		exit(0);
	}
	waitpid(pid, &status, WNOHANG);

	/*
	 * here we run the hot IO.  This is something the drive isn't
	 * going to bypass the cache on, but something the drive will
	 * tend to allow to dominate the cache.
	 */
	printf("starting hot writes run\n");
	cur = 0;
	while(1) {
		pwrite(fd, buf, page_size, cur);
		cur += page_size;
		if (cur > hot_size)
			cur = 0;
	}

	return 0;
}

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: [PATCH 5/7] vfs: Add  wbcflush sysfs knob to disable storage device writeback cache flushes
  2009-03-30 17:55                                                         ` Jeff Garzik
@ 2009-03-30 17:59                                                           ` Jens Axboe
  2009-03-30 19:09                                                             ` Jeff Garzik
  0 siblings, 1 reply; 664+ messages in thread
From: Jens Axboe @ 2009-03-30 17:59 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Bartlomiej Zolnierkiewicz, Fernando Luis Vázquez Cao,
	Christoph Hellwig, Linus Torvalds, Theodore Tso, Ingo Molnar,
	Alan Cox, Arjan van de Ven, Andrew Morton, Peter Zijlstra,
	Nick Piggin, David Rees, Jesper Krogh, Linux Kernel Mailing List,
	chris.mason, david, tj

On Mon, Mar 30 2009, Jeff Garzik wrote:
> Jens Axboe wrote:
>> On Mon, Mar 30 2009, Bartlomiej Zolnierkiewicz wrote:
>>> On Monday 30 March 2009, Fernando Luis Vázquez Cao wrote:
>>>> Add a sysfs knob to disable storage device writeback cache flushes.
>>> The horde of casual desktop users (with me included) would probably prefer
>>> having two settings -- one for filesystem barriers and one for fsync().
>>>
>>> IOW I prefer higher performance at the cost of risking losing few last
>>> seconds/minutes of work in case of crash / powerfailure but I would still
>>> like to have the filesystem in the consistent state after such accident.
>>
>> The knob is meant to control whether we really need to send a flush to
>> the device or not, so it's an orthogonal issue to what you are talking
>> about. For battery backed caches, we never need to flush. This knob is
>> useful IFF we have devices with write back caches that STILL do a cache
>> flush.
>
> How do installers and/or kernels detect a battery-backed cache that does
> not need flush?

They obviously can't, otherwise it would not be an issue at all. And
whether it's an issue is up for debate, until someone can point at such
a device. You could add a white/blacklist.

So either that knob has to be turned by an administrator (yeah...), or
the in-kernel info would have to be updated. Or a udev rule.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-30 17:51                                                                                                 ` Linus Torvalds
@ 2009-03-30 18:15                                                                                                   ` Ric Wheeler
  2009-03-30 19:08                                                                                                   ` Eric Sandeen
  2009-03-30 19:22                                                                                                   ` Rik van Riel
  2 siblings, 0 replies; 664+ messages in thread
From: Ric Wheeler @ 2009-03-30 18:15 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andreas T.Auer, Alan Cox, Theodore Tso, Mark Lord,
	Stefan Richter, Jeff Garzik, Matthew Garrett, Andrew Morton,
	David Rees, Jesper Krogh, Linux Kernel Mailing List

Linus Torvalds wrote:
> 
> On Mon, 30 Mar 2009, Ric Wheeler wrote:
>>> But turn that around, and say: if you don't have redundant disks, then
>>> pretty much by definition those drive flushes won't be guaranteeing your
>>> data _anyway_, so why pay the price?
>> They do in fact provide that promise for the extremely common case of power
>> outage and as such, can be used to build reliable storage if you need to.
> 
> No they really effectively don't. Not if the end result is "oops, the 
> whole track is now unreadable" (regardless of whether it happened due to a 
> write durign power-out or during some entirely unrelated disk error). Your 
> "flush" didn't result in a stable filesystem at all, it just resulted in a 
> dead one.
> 
> That's my point. Disks simply aren't that reliable. Anything you do with 
> flushing and ordering won't make them magically not have errors any more.


They actually are reliable in this way, I have not seen disks fail as you seem 
to think that they do after a simple power failure. With barriers (and barrier 
flushes enabled), you don't get that kind of bad reads for tracks after a normal 
power outage.

Some of the odd cases come from hot spotting of drives (say, rewriting the same 
sector over and over again) which can over many, many writes impact the 
integrity of the adjacent tracks. Or, you can get IO errors from temporary 
vibration (dropped the laptop or rolled a new machine down the data center). 
Those temporary errors are the ones that can be repaired.

I don't know how else to convince you (lots of good wine? beer? :-)), but I have 
personally looked at this in depth. Certainly, "Trust me, I know disks" is not 
really an argument that you have to buy...


> 
>> Heat is a major killer of spinning drives (as is severe cold). A lot of times,
>> drives that have read errors only (not failed writes) might be fully
>> recoverable if you can re-write that injured sector.
> 
> It's not worked for me, and yes, I've tried. Maybe I've been unlucky, but 
> every single case I can remember of having read failures, that drive has 
> been dead. Trying to re-write just the sectors with the error (and around 
> it) didn't do squat, and rewriting the whole disk didn't work either.

Lap top drives are more likely to fail hard - you might have really just had a 
bad head or similar issue.

Mark Lord hacked in support for doing low level writes into hdparm - might be 
worth playing with that next time you get a dud disk.

> 
> I'm sure it works for some "ok, the write just failed to take, and the CRC 
> was bad" case, but that's apparently not what I've had. I suspect either 
> the track markers got overwritten (and maybe a disk-specific low-level 
> reformat would have helped, but at that point I was not going to trust the 
> drive anyway, so I didn't care), or there was actual major physical damage 
> due to heat and/or head crash and remapping was just not able to cope.
> 
>>> Sure. And those "write flushes" really only cover a rather small percentage.
>>> For many setups, the other corruption issues (drive failure) are not just
>>> more common, but generally more disastrous anyway. So why would a person
>>> like that worry about the (rare) power failure?
>> This is simply not a true statement from what I have seen personally.
> 
> You yourself said that software errors were your biggest issue. The write 
> flush wouldn't matter for those (but the elevator barrier would)

How you bucket software issues in a hardware company (old job, not here at Red 
Hat) would include things like "file system corrupt, but disk hardware good" 
which results from improper barrier configuration.

A disk hardware failure would be something like the drive does not spin up, it 
has bad memory in the write cache, a broken head (actually, one of the most 
common errors). Those usually would result in the drive failing to mount.


> 
>> The elevator does not issue write barriers on its own - those write barriers
>> are sent down by the file systems for transaction commits.
> 
> Right. But "elevator write barrier" vs "sending a drive flush command" are 
> two totally independent issues. You can do one without the other (although 
> doing a drive flush command without the write barrier is admittedly kind 
> of pointless ;^)
> 
> And my point is, IT MAKES SENSE to just do the elevator barrier, _without_ 
> the drive command. If you worry much more about software (or non-disk 
> component) failure than about power failures, you're better off just doing 
> the software-level synchronization, and leaving the hardware alone.
> 
> 			Linus

I guess we have to agree to disagree.

File systems need ordering for transactions and recoverability. Doing barriers 
just in the elevator will appear to work well for casual users, but in any given 
large population (including desktops here), will produce more corrupted file 
systems, manual recoveries after power failure, etc.

File systems people can work harder to reduce fsync latency, but getting rid of 
these fundamental building blocks is not really a good plan in my opinion. I am 
pretty sure that we can get a safe and high performing file system balance here 
that will not seem as bad as you have experienced.

Ric





^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: [PATCH 1/7] block: Add block_flush_device()
  2009-03-30 17:55                                                       ` Jens Axboe
@ 2009-03-30 18:27                                                         ` Linus Torvalds
  2009-03-30 18:54                                                           ` Jens Axboe
  0 siblings, 1 reply; 664+ messages in thread
From: Linus Torvalds @ 2009-03-30 18:27 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Fernando Luis Vázquez Cao, Jeff Garzik, Christoph Hellwig,
	Theodore Tso, Ingo Molnar, Alan Cox, Arjan van de Ven,
	Andrew Morton, Peter Zijlstra, Nick Piggin, David Rees,
	Jesper Krogh, Linux Kernel Mailing List, chris.mason, david, tj



On Mon, 30 Mar 2009, Jens Axboe wrote:
> 
> The problem is that we may not know upfront, so it sort-of has to be
> this trial approach where the first barrier issued will notice and fail
> with -EOPNOTSUPP.

Well, absolutely. Except I don't think you shoul use ENOTSUPP, you should 
just set a bit in the "struct request_queue", and then return 0.

IOW, something like this

	--- a/block/blk-barrier.c
	+++ b/block/blk-barrier.c
	@@ -318,6 +318,9 @@ int blkdev_issue_flush(struct block_device *bdev, sector_t *error_sector)
	 	if (!q)
	 		return -ENXIO;
	 
	+	if (is_queue_noflush(q))
	+		return 0;
	+
	 	bio = bio_alloc(GFP_KERNEL, 0);
	 	if (!bio)
	 		return -ENOMEM;
	@@ -339,7 +342,7 @@ int blkdev_issue_flush(struct block_device *bdev, sector_t *error_sector)
	 
	 	ret = 0;
	 	if (bio_flagged(bio, BIO_EOPNOTSUPP))
	-		ret = -EOPNOTSUPP;
	+		set_queue_noflush(q);
	 	else if (!bio_flagged(bio, BIO_UPTODATE))
	 		ret = -EIO;
	 

which just returns 0 if we don't support flushing on that queue.

(Obviously incomplete patch, which is why I also intentionally 
whitespace-broke it).

> Sure, we could cache this value, but it's pretty
> pointless since the filesystem will stop sending barriers in this case.

Well no, it won't. Or rather, it will have to have such a stupid 
per-filesystem flag, for no good reason.

> For blkdev_issue_flush() it may not be very interesting, since there's
> not much we can do about that. Just seems like very bad style to NOT
> return an error in such a case. You can assume that ordering is fine,
> but it definitely wont be in all case (eg devices that have write back
> caching on by default and don't support flush).

So?

The thing is, you can't _do_ anything about it. So what's the point in 
returning an error? The caller cannot possibly care - because there is 
nothing the caller can really do.

Sure, the device may or may not re-order things, but since the caller 
can't know, and can't really do a thing about it _anyway_, you're just 
better off not even confusing anybody.

		Linus

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-30 17:57                                                                                                           ` Chris Mason
@ 2009-03-30 18:39                                                                                                             ` Mark Lord
  2009-03-30 18:52                                                                                                               ` Chris Mason
  2009-03-30 18:54                                                                                                             ` Pasi Kärkkäinen
  1 sibling, 1 reply; 664+ messages in thread
From: Mark Lord @ 2009-03-30 18:39 UTC (permalink / raw)
  To: Chris Mason
  Cc: Linus Torvalds, Ric Wheeler, Andreas T.Auer, Alan Cox,
	Theodore Tso, Stefan Richter, Jeff Garzik, Matthew Garrett,
	Andrew Morton, David Rees, Jesper Krogh,
	Linux Kernel Mailing List

Chris Mason wrote:
>
> I had some fun trying things with this, and I've been able to reliably
> trigger stalls in write cache of ~60 seconds on my seagate 500GB sata
> drive.  The worst I saw was 214 seconds.
..

I'd be more interested in how you managed that (above),
than the quite different test you describe below.

Yes, different, I think.  The test below just times how long a single
chunk of data might stay in-drive cache under constant load,
rather than how long it takes to flush the drive cache on command.

Right?

Still, useful for other stuff.

> It took a little experimentation, and I had to switch to the noop
> scheduler (no idea why).  
> 
> Also, I had to watch vmstat closely.  When the test first started,
> vmstat was reporting 500kb/s or so write throughput.  After the test ran
> for a few minutes, vmstat jumped up to 8MB/s.
> 
> My guess is that the drive has some internal threshold for when it
> decides to only write in cache.  The switch to 8MB/s is when it switched
> to cache only goodness.  Or perhaps the attached program is buggy and
> I'll end up looking silly...it was some quick coding.
> 
> The test forks two procs.  One proc does 4k writes to the first 26MB of
> the test file (/dev/sdb for me).  These writes are O_DIRECT, and use a
> block size of 4k.
> 
> The idea is that we fill the cache with work that is very beneficial to
> keep in cache, but that the drive will tend to flush out because it is
> filling up tracks.
> 
> The second proc O_DIRECT writes to two adjacent sectors far away from
> the hot writes from the first proc, and it puts in a timestamp from just
> before the write.  Every second or so, this timestamp is printed to
> stderr.  The drive will want to keep these two sectors in cache because
> we are constantly overwriting them.
> 
> (It's worth mentioning this is a destructive test.  Running it
> on /dev/sdb will overwrite the first 64MB of the drive!!!!)
> 
> Sample output:
> 
> # ./wb-latency /dev/sdb
> Found tv 1238434622.461527
> starting hot writes run
> starting tester run
> current time 1238435045.529751
> current time 1238435046.531250
> ...
> current time 1238435063.772456
> current time 1238435064.788639
> current time 1238435065.814101
> current time 1238435066.847704
> 
> Right here, I pull the power cord.  The box comes back up, and I run:
> 
> # ./wb-latency -c /dev/sdb
> Found tv 1238435067.347829
> 
> When -c is passed, it just reads the timestamp out of the timestamp
> block and exits.  You compare this value with the value printed just
> before you pulled the block.
> 
> For the run here, the two values are within .5s of each other.  The
> tester only prints the time every one second, so anything that close is
> very good.  I had pulled the plug before the drive got into that fast
> 8MB/s mode, so the drive was doing a pretty good job of fairly servicing
> the cache.
> 
> My drive has a cache of 32MB.  Smaller caches probably need a smaller
> hot zone.
> 
> -chris
> 
> 


^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-30 18:39                                                                                                             ` Mark Lord
@ 2009-03-30 18:52                                                                                                               ` Chris Mason
  2009-03-30 20:19                                                                                                                 ` Mark Lord
  0 siblings, 1 reply; 664+ messages in thread
From: Chris Mason @ 2009-03-30 18:52 UTC (permalink / raw)
  To: Mark Lord
  Cc: Linus Torvalds, Ric Wheeler, Andreas T.Auer, Alan Cox,
	Theodore Tso, Stefan Richter, Jeff Garzik, Matthew Garrett,
	Andrew Morton, David Rees, Jesper Krogh,
	Linux Kernel Mailing List

On Mon, 2009-03-30 at 14:39 -0400, Mark Lord wrote:
> Chris Mason wrote:
> >
> > I had some fun trying things with this, and I've been able to reliably
> > trigger stalls in write cache of ~60 seconds on my seagate 500GB sata
> > drive.  The worst I saw was 214 seconds.
> ..
> 
> I'd be more interested in how you managed that (above),
> than the quite different test you describe below.
> 
> Yes, different, I think.  The test below just times how long a single
> chunk of data might stay in-drive cache under constant load,
> rather than how long it takes to flush the drive cache on command.
> 
> Right?
> 
> Still, useful for other stuff.
> 

That's right, it is testing for starvation in a single sector, not for
how long the cache flush actually takes.  But, your remark from higher
up in the thread was this:

        > 
        > Anything in the drive's write cache very probably made 
        > it to the media within a second or two of arriving there.
        >
        
Sorry if I misread things.  But the goal is just to show that it really
does matter if we use a writeback cache with or without barriers.  The
test has two datasets:

1) An area that is constantly overwritten sequentially
2) A single sector that stores a critical bit of data.

#1 is the filesystem log, #2 is the filesystem super.  This isn't a
specialized workload ;)

-chris



^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: [PATCH 1/7] block: Add block_flush_device()
  2009-03-30 18:27                                                         ` Linus Torvalds
@ 2009-03-30 18:54                                                           ` Jens Axboe
  2009-03-30 19:16                                                             ` Jeff Garzik
  2009-03-30 19:45                                                             ` Linus Torvalds
  0 siblings, 2 replies; 664+ messages in thread
From: Jens Axboe @ 2009-03-30 18:54 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Fernando Luis Vázquez Cao, Jeff Garzik, Christoph Hellwig,
	Theodore Tso, Ingo Molnar, Alan Cox, Arjan van de Ven,
	Andrew Morton, Peter Zijlstra, Nick Piggin, David Rees,
	Jesper Krogh, Linux Kernel Mailing List, chris.mason, david, tj

On Mon, Mar 30 2009, Linus Torvalds wrote:
> 
> 
> On Mon, 30 Mar 2009, Jens Axboe wrote:
> > 
> > The problem is that we may not know upfront, so it sort-of has to be
> > this trial approach where the first barrier issued will notice and fail
> > with -EOPNOTSUPP.
> 
> Well, absolutely. Except I don't think you shoul use ENOTSUPP, you should 
> just set a bit in the "struct request_queue", and then return 0.
> 
> IOW, something like this
> 
> 	--- a/block/blk-barrier.c
> 	+++ b/block/blk-barrier.c
> 	@@ -318,6 +318,9 @@ int blkdev_issue_flush(struct block_device *bdev, sector_t *error_sector)
> 	 	if (!q)
> 	 		return -ENXIO;
> 	 
> 	+	if (is_queue_noflush(q))
> 	+		return 0;
> 	+
> 	 	bio = bio_alloc(GFP_KERNEL, 0);
> 	 	if (!bio)
> 	 		return -ENOMEM;
> 	@@ -339,7 +342,7 @@ int blkdev_issue_flush(struct block_device *bdev, sector_t *error_sector)
> 	 
> 	 	ret = 0;
> 	 	if (bio_flagged(bio, BIO_EOPNOTSUPP))
> 	-		ret = -EOPNOTSUPP;
> 	+		set_queue_noflush(q);
> 	 	else if (!bio_flagged(bio, BIO_UPTODATE))
> 	 		ret = -EIO;
> 	 
> 
> which just returns 0 if we don't support flushing on that queue.
> 
> (Obviously incomplete patch, which is why I also intentionally 
> whitespace-broke it).
> 
> > Sure, we could cache this value, but it's pretty
> > pointless since the filesystem will stop sending barriers in this case.
> 
> Well no, it won't. Or rather, it will have to have such a stupid 
> per-filesystem flag, for no good reason.

Sorry, I just don't see much point to doing it this way instead. So now
the fs will have to check a queue bit after it has issued the flush, how
is that any better than having the 'error' returned directly?

> > For blkdev_issue_flush() it may not be very interesting, since there's
> > not much we can do about that. Just seems like very bad style to NOT
> > return an error in such a case. You can assume that ordering is fine,
> > but it definitely wont be in all case (eg devices that have write back
> > caching on by default and don't support flush).
> 
> So?
> 
> The thing is, you can't _do_ anything about it. So what's the point in 
> returning an error? The caller cannot possibly care - because there is 
> nothing the caller can really do.

Not for blkdev_issue_flush(), all they can do is report about the
device. And even that would be a vague "Your data may or may not be
safe, we don't know".

> Sure, the device may or may not re-order things, but since the caller 
> can't know, and can't really do a thing about it _anyway_, you're just 
> better off not even confusing anybody.

I'd call that a pretty reckless approach to data integrity, honestly.
You HAVE to issue an error in this case. Then the user/admin can at least
check up on the device stack in question, and determine whether this is
an issue or not. That goes for both blkdev_issue_flush() and the actual
barrier write. And perhaps the cached value is then of some use, since
you then know when to warn (bit not already set) and you can keep the
warning in blkdev_issue_flush() instead of putting it in every call
site.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-30 17:57                                                                                                           ` Chris Mason
  2009-03-30 18:39                                                                                                             ` Mark Lord
@ 2009-03-30 18:54                                                                                                             ` Pasi Kärkkäinen
  1 sibling, 0 replies; 664+ messages in thread
From: Pasi Kärkkäinen @ 2009-03-30 18:54 UTC (permalink / raw)
  To: Chris Mason
  Cc: Linus Torvalds, Mark Lord, Ric Wheeler, Andreas T.Auer, Alan Cox,
	Theodore Tso, Stefan Richter, Jeff Garzik, Matthew Garrett,
	Andrew Morton, David Rees, Jesper Krogh,
	Linux Kernel Mailing List

On Mon, Mar 30, 2009 at 01:57:12PM -0400, Chris Mason wrote:
> On Mon, 2009-03-30 at 09:58 -0700, Linus Torvalds wrote:
> > 
> > On Mon, 30 Mar 2009, Mark Lord wrote:
> > >
> > > I spent an entire day recently, trying to see if I could significantly fill
> > > up the 32MB cache on a 750GB Hitach SATA drive here.
> > > 
> > > With deliberate/random write patterns, big and small, near and far,
> > > I could not fill the drive with anything approaching a full second
> > > of latent write-cache flush time.
> > > 
> > > Not even close.  Which is a pity, because I really wanted to do some testing
> > > related to a deep write cache.  But it just wouldn't happen.
> > > 
> > > I tried this again on a 16MB cache of a Seagate drive, no difference.
> > > 
> > > Bummer.  :)
> > 
> > Try it with laptop drives. You might get to a second, or at least hundreds 
> > of ms (not counting the spinup delay if it went to sleep, obviously). You 
> > probably tested desktop drives (that 750GB Hitachi one is not a low end 
> > one, and I assume the Seagate one isn't either).
> 
> I had some fun trying things with this, and I've been able to reliably
> trigger stalls in write cache of ~60 seconds on my seagate 500GB sata
> drive.  The worst I saw was 214 seconds.
> 
> It took a little experimentation, and I had to switch to the noop
> scheduler (no idea why).  
> 

I remember cfq having a bug (or a feature?) that prevents queue depths
deeper than 1.. so with noop you get more ios to the queue. 

-- Pasi

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-30  7:13                                                                                 ` Andreas T.Auer
  2009-03-30  9:05                                                                                   ` Alan Cox
@ 2009-03-30 19:02                                                                                   ` Bill Davidsen
  2009-04-01  1:19                                                                                     ` david
  1 sibling, 1 reply; 664+ messages in thread
From: Bill Davidsen @ 2009-03-30 19:02 UTC (permalink / raw)
  To: linux-kernel

Andreas T.Auer wrote:
> On 30.03.2009 02:39 Theodore Tso wrote:
>> All I can do is apologize to all other filesystem developers profusely
>> for ext3's data=ordered semantics; at this point, I very much regret
>> that we made data=ordered the default for ext3.  But the application
>> writers vastly outnumber us, and realistically we're not going to be
>> able to easily roll back eight years of application writers being
>> trained that fsync() is not necessary, and actually is detrimental for
>> ext3.

> And still I don't know any reason, why it makes sense to write the
> metadata to non-existing data immediately instead of delaying that, too.
> 
Here I have the same question, I don't expect or demand that anything be done in 
a particular order unless I force it so, and I expect there to be some corner 
case where the data is written and the metadata doesn't reflect that in the 
event of a failure, but I can't see that it ever a good idea to have the 
metadata reflect the future and describe what things will look like if 
everything goes as planned. I have had enough of that BS from financial planners 
and politicians, metadata shouldn't try to predict the future just to save a ms 
here or there. It's also necessary to have the metadata match reality after 
fsync(), of course, or even the well behaved applications mentioned in this 
thread haven't a hope of staying consistent.

Feel free to clarify why clairvoyant metadata is ever a good thing...

-- 
Bill Davidsen <davidsen@tmr.com>
   "We have more to fear from the bungling of the incompetent than from
the machinations of the wicked."  - from Slashdot


^ permalink raw reply	[flat|nested] 664+ messages in thread

* range-based cache flushing (was Re: Linux 2.6.29)
  2009-03-25 21:22                                   ` James Bottomley
  2009-03-26  8:59                                     ` Jens Axboe
@ 2009-03-30 19:05                                     ` Jeff Garzik
  2009-04-01  0:14                                       ` James Bottomley
  1 sibling, 1 reply; 664+ messages in thread
From: Jeff Garzik @ 2009-03-30 19:05 UTC (permalink / raw)
  To: James Bottomley
  Cc: Ric Wheeler, Jens Axboe, Linus Torvalds, Theodore Tso,
	Ingo Molnar, Alan Cox, Arjan van de Ven, Andrew Morton,
	Peter Zijlstra, Nick Piggin, David Rees, Jesper Krogh,
	Linux Kernel Mailing List

James Bottomley wrote:
> On Wed, 2009-03-25 at 16:25 -0400, Ric Wheeler wrote:
>> Jeff Garzik wrote:
>>> Ric Wheeler wrote:> And, as I am sure that you do know, to add insult 
>>> to injury, FLUSH_CACHE
>>>> is per device (not file system).

>>>> When you issue an fsync() on a disk with multiple partitions, you 
>>>> will flush the data for all of its partitions from the write cache....
>>> SCSI'S SYNCHRONIZE CACHE command already accepts an (LBA, length) 
>>> pair.  We could make use of that.

>>> And I bet we could convince T13 to add FLUSH CACHE RANGE, if we could 
>>> demonstrate clear benefit.

>> How well supported is this in SCSI?  Can we try it out with a commodity 
>> SAS drive?

> What do you mean by well supported?  The way the SCSI standard is
> written, a device can do a complete cache flush when a range flush is
> requested and still be fully standards compliant.  There's no easy way
> to tell if it does a complete cache flush every time other than by
> taking the firmware apart (or asking the manufacturer).

Quite true, though wondering aloud...

How difficult would it be to pass the "lower-bound" LBA to SYNCHRONIZE 
CACHE, where "lower bound" is defined as the lowest sector in the range 
of sectors to be flushed?

That seems like a reasonable optimization -- it gives the drive an easy 
way to skip sync'ing sectors lower than the lower-bound LBA, if it is 
capable.  Otherwise, a standards-compliant firmware will behave as you 
describe, and do what our code currently expects today -- a full cache 
flush.

This seems like a good way to speed up cache flush [on SCSI], while also 
perhaps experimenting with a more fine-grained way to pass down write 
barriers to the device.

Not a high priority thing overall, but OTOH, consider the case of 
placing your journal at the end of the disk.  You could then issue a 
cache flush with a non-zero starting offset:

	SYNCHRONIZE CACHE (max sectors - JOURNAL_SIZE, ~0)

That should be trivial even for dumb disk firmwares to optimize.

	Jeff




^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-30 17:51                                                                                                 ` Linus Torvalds
  2009-03-30 18:15                                                                                                   ` Ric Wheeler
@ 2009-03-30 19:08                                                                                                   ` Eric Sandeen
  2009-03-30 19:22                                                                                                   ` Rik van Riel
  2 siblings, 0 replies; 664+ messages in thread
From: Eric Sandeen @ 2009-03-30 19:08 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ric Wheeler, Andreas T.Auer, Alan Cox, Theodore Tso, Mark Lord,
	Stefan Richter, Jeff Garzik, Matthew Garrett, Andrew Morton,
	David Rees, Jesper Krogh, Linux Kernel Mailing List

Linus Torvalds wrote:
> 
> On Mon, 30 Mar 2009, Ric Wheeler wrote:
>>> But turn that around, and say: if you don't have redundant disks, then
>>> pretty much by definition those drive flushes won't be guaranteeing your
>>> data _anyway_, so why pay the price?
>> They do in fact provide that promise for the extremely common case of power
>> outage and as such, can be used to build reliable storage if you need to.
> 
> No they really effectively don't. Not if the end result is "oops, the 
> whole track is now unreadable" (regardless of whether it happened due to a 
> write durign power-out or during some entirely unrelated disk error). Your 
> "flush" didn't result in a stable filesystem at all, it just resulted in a 
> dead one.
> 
> That's my point. Disks simply aren't that reliable. Anything you do with 
> flushing and ordering won't make them magically not have errors any more.

But this is apples and oranges isn't it?

All of the effort that goes into metadata journalling in ext3, ext4,
xfs, reiserfs, jfs ... is to save us from the fsck time on restart, and
ensure a consistent filesystem framework (metadata, that is, in
general), after an unclean shutdown.  That could be due to a system
crash or a power outage.  This is much more common in my personal
experience than a drive failure.

That journalling requires ordering guarantees, and with large drive
write caches, and no ordering, it's not hard for it to go south to the
point where things *do* get corrupted when you lose power or the drive
resets in the middle of basically random write cache destaging.  See
Chris Mason's tests from a year or so ago, proving that ext3 is quite
vulnerable to this - it likely explains some of the random htree
corruption that occasionally gets reported to us.

And yes, sometimes drives die, and then you are really screwed, but
that's orthogonal to all of the above, I think.

-Eric

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: [PATCH 5/7] vfs: Add  wbcflush sysfs knob to disable storage device writeback cache flushes
  2009-03-30 17:59                                                           ` Jens Axboe
@ 2009-03-30 19:09                                                             ` Jeff Garzik
  2009-03-30 20:56                                                               ` Bartlomiej Zolnierkiewicz
  0 siblings, 1 reply; 664+ messages in thread
From: Jeff Garzik @ 2009-03-30 19:09 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Bartlomiej Zolnierkiewicz, Fernando Luis Vázquez Cao,
	Christoph Hellwig, Linus Torvalds, Theodore Tso, Ingo Molnar,
	Alan Cox, Arjan van de Ven, Andrew Morton, Peter Zijlstra,
	Nick Piggin, David Rees, Jesper Krogh, Linux Kernel Mailing List,
	chris.mason, david, tj

Jens Axboe wrote:
> On Mon, Mar 30 2009, Jeff Garzik wrote:
>> Jens Axboe wrote:
>>> On Mon, Mar 30 2009, Bartlomiej Zolnierkiewicz wrote:
>>>> On Monday 30 March 2009, Fernando Luis Vázquez Cao wrote:
>>>>> Add a sysfs knob to disable storage device writeback cache flushes.
>>>> The horde of casual desktop users (with me included) would probably prefer
>>>> having two settings -- one for filesystem barriers and one for fsync().
>>>>
>>>> IOW I prefer higher performance at the cost of risking losing few last
>>>> seconds/minutes of work in case of crash / powerfailure but I would still
>>>> like to have the filesystem in the consistent state after such accident.
>>> The knob is meant to control whether we really need to send a flush to
>>> the device or not, so it's an orthogonal issue to what you are talking
>>> about. For battery backed caches, we never need to flush. This knob is
>>> useful IFF we have devices with write back caches that STILL do a cache
>>> flush.
>> How do installers and/or kernels detect a battery-backed cache that does
>> not need flush?
> 
> They obviously can't, otherwise it would not be an issue at all. And
> whether it's an issue is up for debate, until someone can point at such
> a device. You could add a white/blacklist.

Sorry, I guess I misinterpreted your dual "IFF" statement :)

I completely agree that the suggested knob, for disabling cache flush 
for these battery-backed devices, is at the present time addressing an 
entirely theoretical argument AFAICS.

	Jeff



^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: [PATCH 1/7] block: Add block_flush_device()
  2009-03-30 18:54                                                           ` Jens Axboe
@ 2009-03-30 19:16                                                             ` Jeff Garzik
  2009-03-30 19:24                                                               ` Chris Mason
  2009-03-30 19:59                                                               ` Linus Torvalds
  2009-03-30 19:45                                                             ` Linus Torvalds
  1 sibling, 2 replies; 664+ messages in thread
From: Jeff Garzik @ 2009-03-30 19:16 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Linus Torvalds, Fernando Luis Vázquez Cao,
	Christoph Hellwig, Theodore Tso, Ingo Molnar, Alan Cox,
	Arjan van de Ven, Andrew Morton, Peter Zijlstra, Nick Piggin,
	David Rees, Jesper Krogh, Linux Kernel Mailing List, chris.mason,
	david, tj

Jens Axboe wrote:
> On Mon, Mar 30 2009, Linus Torvalds wrote:
>>
>> On Mon, 30 Mar 2009, Jens Axboe wrote:
>>> The problem is that we may not know upfront, so it sort-of has to be
>>> this trial approach where the first barrier issued will notice and fail
>>> with -EOPNOTSUPP.
>> Well, absolutely. Except I don't think you shoul use ENOTSUPP, you should 
>> just set a bit in the "struct request_queue", and then return 0.
>>
>> IOW, something like this
>>
>> 	--- a/block/blk-barrier.c
>> 	+++ b/block/blk-barrier.c
>> 	@@ -318,6 +318,9 @@ int blkdev_issue_flush(struct block_device *bdev, sector_t *error_sector)
>> 	 	if (!q)
>> 	 		return -ENXIO;
>> 	 
>> 	+	if (is_queue_noflush(q))
>> 	+		return 0;
>> 	+
>> 	 	bio = bio_alloc(GFP_KERNEL, 0);
>> 	 	if (!bio)
>> 	 		return -ENOMEM;
>> 	@@ -339,7 +342,7 @@ int blkdev_issue_flush(struct block_device *bdev, sector_t *error_sector)
>> 	 
>> 	 	ret = 0;
>> 	 	if (bio_flagged(bio, BIO_EOPNOTSUPP))
>> 	-		ret = -EOPNOTSUPP;
>> 	+		set_queue_noflush(q);
>> 	 	else if (!bio_flagged(bio, BIO_UPTODATE))
>> 	 		ret = -EIO;
>> 	 
>>
>> which just returns 0 if we don't support flushing on that queue.
>>
>> (Obviously incomplete patch, which is why I also intentionally 
>> whitespace-broke it).
>>
>>> Sure, we could cache this value, but it's pretty
>>> pointless since the filesystem will stop sending barriers in this case.
>> Well no, it won't. Or rather, it will have to have such a stupid 
>> per-filesystem flag, for no good reason.
> 
> Sorry, I just don't see much point to doing it this way instead. So now
> the fs will have to check a queue bit after it has issued the flush, how
> is that any better than having the 'error' returned directly?

AFAICS, the aim is simply to return zero rather than EOPNOTSUPP, for the 
not-supported case, rather than burdening all callers with such checks.

Which is quite reasonable for Fernando's patch -- the direct call fsync 
case.

But that leaves open the possibility that some people really do want the 
EOPNOTSUPP return value, I guess?  Do existing callers need that?


>>> For blkdev_issue_flush() it may not be very interesting, since there's
>>> not much we can do about that. Just seems like very bad style to NOT
>>> return an error in such a case. You can assume that ordering is fine,
>>> but it definitely wont be in all case (eg devices that have write back
>>> caching on by default and don't support flush).
>> So?
>>
>> The thing is, you can't _do_ anything about it. So what's the point in 
>> returning an error? The caller cannot possibly care - because there is 
>> nothing the caller can really do.
> 
> Not for blkdev_issue_flush(), all they can do is report about the
> device. And even that would be a vague "Your data may or may not be
> safe, we don't know".
> 
>> Sure, the device may or may not re-order things, but since the caller 
>> can't know, and can't really do a thing about it _anyway_, you're just 
>> better off not even confusing anybody.
> 
> I'd call that a pretty reckless approach to data integrity, honestly.
> You HAVE to issue an error in this case. Then the user/admin can at least
> check up on the device stack in question, and determine whether this is
> an issue or not. That goes for both blkdev_issue_flush() and the actual
> barrier write. And perhaps the cached value is then of some use, since
> you then know when to warn (bit not already set) and you can keep the
> warning in blkdev_issue_flush() instead of putting it in every call
> site.

Indeed -- if the drive tells us it failed the cache flush, it seems 
self-evident that we should be passing that failure back to userspace 
where possible.

And as the patches show, it is definitely possible to return a FLUSH 
CACHE error back to an fsync(2) caller [though, yes, I certainly 
recognize fsync is not the only generator of these requests].

	Jeff




^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-30 17:51                                                                                                 ` Linus Torvalds
  2009-03-30 18:15                                                                                                   ` Ric Wheeler
  2009-03-30 19:08                                                                                                   ` Eric Sandeen
@ 2009-03-30 19:22                                                                                                   ` Rik van Riel
  2009-03-30 19:41                                                                                                     ` Jeff Garzik
                                                                                                                       ` (3 more replies)
  2 siblings, 4 replies; 664+ messages in thread
From: Rik van Riel @ 2009-03-30 19:22 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ric Wheeler, Andreas T.Auer, Alan Cox, Theodore Tso, Mark Lord,
	Stefan Richter, Jeff Garzik, Matthew Garrett, Andrew Morton,
	David Rees, Jesper Krogh, Linux Kernel Mailing List

Linus Torvalds wrote:
> On Mon, 30 Mar 2009, Ric Wheeler wrote:

>> Heat is a major killer of spinning drives (as is severe cold). A lot of times,
>> drives that have read errors only (not failed writes) might be fully
>> recoverable if you can re-write that injured sector.
> 
> It's not worked for me, and yes, I've tried.

It's worked here.  It would be nice to have a device mapper module
that can just insert itself between the disk and the higher device
mapper layer and "scrub" the disk, fetching unreadable sectors from
the other RAID copy where required.

> I'm sure it works for some "ok, the write just failed to take, and the CRC 
> was bad" case, but that's apparently not what I've had. I suspect either 
> the track markers got overwritten (and maybe a disk-specific low-level 
> reformat would have helped, but at that point I was not going to trust the 
> drive anyway, so I didn't care), or there was actual major physical damage 
> due to heat and/or head crash and remapping was just not able to cope.

Maybe a stupid question, but aren't tracks so small compared to
the disk head that a physical head crash would take out multiple
tracks at once?  (the last on I experienced here took out a major
part of the disk)

Another case I have seen years ago was me writing data to a disk
while it was still cold (I brought it home, plugged it in and
started using it).  Once the drive came up to temperature, it
could no longer read the tracks it just wrote - maybe the disk
expanded by more than it is willing to seek around for tracks
due to thermal correction?   Low level formatting the drive
made it work perfectly and I kept using it until it was just
too small to be useful :)

> And my point is, IT MAKES SENSE to just do the elevator barrier, _without_ 
> the drive command. 

No argument there.  I have seen NCQ starvation on SATA disks,
with some requests sitting in the drive for seconds, while
the drive was busy handling hundreds of requests/second
elsewhere...

-- 
All rights reversed.

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: [PATCH 1/7] block: Add block_flush_device()
  2009-03-30 19:16                                                             ` Jeff Garzik
@ 2009-03-30 19:24                                                               ` Chris Mason
  2009-03-30 20:09                                                                 ` Andi Kleen
  2009-03-30 19:59                                                               ` Linus Torvalds
  1 sibling, 1 reply; 664+ messages in thread
From: Chris Mason @ 2009-03-30 19:24 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Jens Axboe, Linus Torvalds, Fernando Luis Vázquez Cao,
	Christoph Hellwig, Theodore Tso, Ingo Molnar, Alan Cox,
	Arjan van de Ven, Andrew Morton, Peter Zijlstra, Nick Piggin,
	David Rees, Jesper Krogh, Linux Kernel Mailing List, david, tj

On Mon, 2009-03-30 at 15:16 -0400, Jeff Garzik wrote:
> Jens Axboe wrote:
> > On Mon, Mar 30 2009, Linus Torvalds wrote:
> >>
> >> On Mon, 30 Mar 2009, Jens Axboe wrote:
> >>> The problem is that we may not know upfront, so it sort-of has to be
> >>> this trial approach where the first barrier issued will notice and fail
> >>> with -EOPNOTSUPP.
> >> Well, absolutely. Except I don't think you shoul use ENOTSUPP, you should 
> >> just set a bit in the "struct request_queue", and then return 0.
> >>
> >> IOW, something like this
> >>
> >> 	--- a/block/blk-barrier.c
> >> 	+++ b/block/blk-barrier.c
> >> 	@@ -318,6 +318,9 @@ int blkdev_issue_flush(struct block_device *bdev, sector_t *error_sector)
> >> 	 	if (!q)
> >> 	 		return -ENXIO;
> >> 	 
> >> 	+	if (is_queue_noflush(q))
> >> 	+		return 0;
> >> 	+
> >> 	 	bio = bio_alloc(GFP_KERNEL, 0);
> >> 	 	if (!bio)
> >> 	 		return -ENOMEM;
> >> 	@@ -339,7 +342,7 @@ int blkdev_issue_flush(struct block_device *bdev, sector_t *error_sector)
> >> 	 
> >> 	 	ret = 0;
> >> 	 	if (bio_flagged(bio, BIO_EOPNOTSUPP))
> >> 	-		ret = -EOPNOTSUPP;
> >> 	+		set_queue_noflush(q);
> >> 	 	else if (!bio_flagged(bio, BIO_UPTODATE))
> >> 	 		ret = -EIO;
> >> 	 
> >>
> >> which just returns 0 if we don't support flushing on that queue.
> >>
> >> (Obviously incomplete patch, which is why I also intentionally 
> >> whitespace-broke it).
> >>
> >>> Sure, we could cache this value, but it's pretty
> >>> pointless since the filesystem will stop sending barriers in this case.
> >> Well no, it won't. Or rather, it will have to have such a stupid 
> >> per-filesystem flag, for no good reason.
> > 
> > Sorry, I just don't see much point to doing it this way instead. So now
> > the fs will have to check a queue bit after it has issued the flush, how
> > is that any better than having the 'error' returned directly?
> 
> AFAICS, the aim is simply to return zero rather than EOPNOTSUPP, for the 
> not-supported case, rather than burdening all callers with such checks.
> 
> Which is quite reasonable for Fernando's patch -- the direct call fsync 
> case.
> 
> But that leaves open the possibility that some people really do want the 
> EOPNOTSUPP return value, I guess?  Do existing callers need that?
> 

As far as I know, reiserfs is the only one actively using it to choose
different code.  It moves a single wait_on_buffer when barriers are on,
which I took out once to simplify the code.  Ric saw it in some
benchmark numbers and I put it back in.

Given that it was a long time ago, I don't have a problem with changing
it to work like all the other filesystems.

-chris



^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: [PATCH 2/2] Make relatime default
  2009-03-30 14:42                                                   ` Andrea Arcangeli
  2009-03-30 14:52                                                     ` Xavier Bestel
@ 2009-03-30 19:26                                                     ` Bill Davidsen
  1 sibling, 0 replies; 664+ messages in thread
From: Bill Davidsen @ 2009-03-30 19:26 UTC (permalink / raw)
  To: linux-kernel

Andrea Arcangeli wrote:
> On Thu, Mar 26, 2009 at 06:48:38PM +0000, Alan Cox wrote:
>> On Thu, 26 Mar 2009 17:53:14 +0000
>> Matthew Garrett <mjg@redhat.com> wrote:
>>
>>> Change the default behaviour of the kernel to use relatime for all
>>> filesystems. This can be overridden with the "strictatime" mount
>>> option.
>> NAK this again
> 
> NAK but because if we change the default it is better to change it to
> the real thing: noatime.
> 
> I think this can be solved in userland but perhaps changing this in
> kernel would be a stronger message that atime is officially
> obsoleted. (and nothing will break, not even mutt users will notice,
> and if they really do it won't be anything more than aesthetical)
> 
This makes the assumption that atime is not used, which may be true on your 
system but isn't on others. I regularly move data between faster and slower 
storage based on atime, and promote reactivated projects to something faster, 
while retiring inactive project data elsewhere. Other admins use it to identify 
unused files which are candidates for backup to offline media or the bit bucket.

Let people who want that behavior specify it for existing filesystems, if you 
want to remove functionality from ext4 or btrfs or some thing place where people 
have no existing expectations, I still think it's wrong, but I couldn't say I 
think it might break anything.

I did a patch a few years ago which only updated atime on open and write, and 
that worked about as well as relatime, the inode update on open is cheap, the 
head is already there, and it was only slightly slower than noatime. The were no 
programs which kept files open for days and just read them. The the only storage 
hierarchy was "slow and cheap." ;-)

-- 
Bill Davidsen <davidsen@tmr.com>
   "We have more to fear from the bungling of the incompetent than from
the machinations of the wicked."  - from Slashdot


^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-29 21:07                                                               ` Linus Torvalds
@ 2009-03-30 19:37                                                                 ` Jeremy Fitzhardinge
  0 siblings, 0 replies; 664+ messages in thread
From: Jeremy Fitzhardinge @ 2009-03-30 19:37 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Alan Cox, Xavier Bestel, Matthew Garrett, Theodore Tso,
	Andrew Morton, David Rees, Jesper Krogh,
	Linux Kernel Mailing List

Linus Torvalds wrote:
> This particular problem really largely boils down to "average memory 
> capacity has expanded a _lot_ more than harddisk speeds have gone up".
>   

Yes, but clearly lawyers are better at fixing this kind of problem.

    J

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-30 19:22                                                                                                   ` Rik van Riel
@ 2009-03-30 19:41                                                                                                     ` Jeff Garzik
  2009-03-30 20:21                                                                                                       ` Michael Tokarev
  2009-03-30 20:05                                                                                                     ` Linus Torvalds
                                                                                                                       ` (2 subsequent siblings)
  3 siblings, 1 reply; 664+ messages in thread
From: Jeff Garzik @ 2009-03-30 19:41 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Linus Torvalds, Ric Wheeler, Andreas T.Auer, Alan Cox,
	Theodore Tso, Mark Lord, Stefan Richter, Matthew Garrett,
	Andrew Morton, David Rees, Jesper Krogh,
	Linux Kernel Mailing List

Rik van Riel wrote:
> Linus Torvalds wrote:
>> And my point is, IT MAKES SENSE to just do the elevator barrier, 
>> _without_ the drive command. 
> 
> No argument there.  I have seen NCQ starvation on SATA disks,
> with some requests sitting in the drive for seconds, while
> the drive was busy handling hundreds of requests/second
> elsewhere...

If certain requests are hanging out in the drive's wbcache longer than 
others, that increases the probability that OS filesystem-required, 
elevator-provided ordering becomes skewed once requests are passed to 
drive firmware.

The sad, sucky fact is that NCQ starvation implies FLUSH CACHE is more 
important than ever, if filesystems want to get ordering correct.




IDEALLY, according to the SATA protocol spec, we could issue up to 32 
NCQ commands to a SATA drive, each marked with the "FUA" bit to force 
the command to hit permanent media before returning.

In theory, this NCQ+FUA mode gives the drive maximum ability to optimize 
parallel in-progress commands, decoupling command completion and command 
issue -- while also giving the OS complete control of ordering by virtue 
of emptying the SATA tagged command queue.

In practice, NCQ+FUA flat out did not work on early drives, and 
performance was way under what you would expect for parallel write-thru 
command execution.  I haven't benchmarked NCQ+FUA in a few years; it 
might be worth revisiting.

	Jeff




^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: [PATCH 1/7] block: Add block_flush_device()
  2009-03-30 18:54                                                           ` Jens Axboe
  2009-03-30 19:16                                                             ` Jeff Garzik
@ 2009-03-30 19:45                                                             ` Linus Torvalds
  2009-03-30 20:17                                                               ` Jens Axboe
  1 sibling, 1 reply; 664+ messages in thread
From: Linus Torvalds @ 2009-03-30 19:45 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Fernando Luis Vázquez Cao, Jeff Garzik, Christoph Hellwig,
	Theodore Tso, Ingo Molnar, Alan Cox, Arjan van de Ven,
	Andrew Morton, Peter Zijlstra, Nick Piggin, David Rees,
	Jesper Krogh, Linux Kernel Mailing List, chris.mason, david, tj



On Mon, 30 Mar 2009, Jens Axboe wrote:
> 
> Sorry, I just don't see much point to doing it this way instead. So now
> the fs will have to check a queue bit after it has issued the flush, how
> is that any better than having the 'error' returned directly?

No.

Now the fs SHOULD NEVER CHECK AT ALL.

Either it did the ordering, or the FS cannot do anything about it. 

That's the point. EOPNOTSUPP is n ot a useful error message. You can't 
_do_ anything about it.

> > Sure, the device may or may not re-order things, but since the caller 
> > can't know, and can't really do a thing about it _anyway_, you're just 
> > better off not even confusing anybody.
> 
> I'd call that a pretty reckless approach to data integrity, honestly.

It has _nothing_ to do with 'reckless'. It has everything to do with 'you 
can't do anything about it'.

> You HAVE to issue an error in this case.

No. Returning an error just means that now the box is useless. Nobody can 
do anything about it. Not the admin, not the driver writer, not anybody. 

Ok, so a device didn't support flushing. We don't know why, we don't know 
if it needed it, we simply don't know. There's nothing to do. But 
returning an error to user mode is unacceptable, because that will result 
in everything just -failing-. 

And total failure is much worse than "we don't know whether the thing 
serialized".

			Linus

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: [PATCH 1/7] block: Add block_flush_device()
  2009-03-30 19:16                                                             ` Jeff Garzik
  2009-03-30 19:24                                                               ` Chris Mason
@ 2009-03-30 19:59                                                               ` Linus Torvalds
  2009-03-30 20:31                                                                 ` Jeff Garzik
  1 sibling, 1 reply; 664+ messages in thread
From: Linus Torvalds @ 2009-03-30 19:59 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Jens Axboe, Fernando Luis Vázquez Cao, Christoph Hellwig,
	Theodore Tso, Ingo Molnar, Alan Cox, Arjan van de Ven,
	Andrew Morton, Peter Zijlstra, Nick Piggin, David Rees,
	Jesper Krogh, Linux Kernel Mailing List, chris.mason, david, tj



On Mon, 30 Mar 2009, Jeff Garzik wrote:
>
> Indeed -- if the drive tells us it failed the cache flush, it seems
> self-evident that we should be passing that failure back to userspace where
> possible.

That's not what EOPNOTSUPP means!

EOPNOTSUPP doesn't mean "the cache flush failed". It just means "I don't 
support cache flushing".

No failure anywhere. See?

Maybe the operation isn't supported becasue there are no caches? Who the 
hell knows? Nobody. The layer just said "I don't support this". For 
example, maybe it just cannot translate the "flush cache" op into its own 
command set, because the thing doesn't _do_ anything like that.

For a concrete example, look at the "loop" driver. It literally returns 
EOPNOTSUPP if the filesystem doesn't have a "fsync()" thing. Ok, so it 
can't do serialization - does that mean that the caller should fail 
entirely? No. But it means that the caller cannot serialize, so now the 
caller has two choices:

 - not work at all

 - ignore it, and assume that a device without serialization is serialized 
   enough as-is.

Those are the two only choices. The caller knows that it can't flush. What 
would you _suggest_ it do? Just stop, and do nothing at all? I rally don't 
think that's a useful or valid approach.

And notice - at NO TIME did anythign actually fail. It's just that the 
particular protocol didn't support that empty flush op.

(Also note that block/blk-barrier.c really does an empty barrier command. 
If we were to be talking about a real IO with a real payload and the 
"barrier" bit set, that would be different. But we really aren't.)


		Linus

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-30 19:22                                                                                                   ` Rik van Riel
  2009-03-30 19:41                                                                                                     ` Jeff Garzik
@ 2009-03-30 20:05                                                                                                     ` Linus Torvalds
  2009-03-31  9:27                                                                                                     ` Neil Brown
  2009-03-31 21:13                                                                                                     ` Alan Cox
  3 siblings, 0 replies; 664+ messages in thread
From: Linus Torvalds @ 2009-03-30 20:05 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Ric Wheeler, Andreas T.Auer, Alan Cox, Theodore Tso, Mark Lord,
	Stefan Richter, Jeff Garzik, Matthew Garrett, Andrew Morton,
	David Rees, Jesper Krogh, Linux Kernel Mailing List



On Mon, 30 Mar 2009, Rik van Riel wrote:
> 
> Maybe a stupid question, but aren't tracks so small compared to
> the disk head that a physical head crash would take out multiple
> tracks at once?  (the last on I experienced here took out a major
> part of the disk)

Probably. My experiences (not _that_ many drives, but more than one) have 
certainly been that I've never seen a _single_ read error.

> Another case I have seen years ago was me writing data to a disk
> while it was still cold (I brought it home, plugged it in and
> started using it).  Once the drive came up to temperature, it
> could no longer read the tracks it just wrote - maybe the disk
> expanded by more than it is willing to seek around for tracks
> due to thermal correction?   Low level formatting the drive
> made it work perfectly and I kept using it until it was just
> too small to be useful :)

I've had one drive that just stopped spinning. On power-on, it would make 
these pitiful noises trying to get the platters to move, but not actually 
ever work. If I recall correctly, I got the data off it by letting it just 
cool down, then powering up (successfully) and transferring all the data 
I cared about off the disk. And then replacing the disk.

> > And my point is, IT MAKES SENSE to just do the elevator barrier, _without_
> > the drive command. 
> 
> No argument there.  I have seen NCQ starvation on SATA disks,
> with some requests sitting in the drive for seconds, while
> the drive was busy handling hundreds of requests/second
> elsewhere...

I _thought_ we stopped feeding new requests while the flush was active, so 
if you actually do a flush, that should never actually happen. But I 
didn't check.

		Linus

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: [PATCH 1/7] block: Add block_flush_device()
  2009-03-30 19:24                                                               ` Chris Mason
@ 2009-03-30 20:09                                                                 ` Andi Kleen
  2009-03-30 20:15                                                                   ` Chris Mason
  0 siblings, 1 reply; 664+ messages in thread
From: Andi Kleen @ 2009-03-30 20:09 UTC (permalink / raw)
  To: Chris Mason
  Cc: Jeff Garzik, Jens Axboe, Linus Torvalds,
	Fernando Luis Vázquez Cao, Christoph Hellwig, Theodore Tso,
	Ingo Molnar, Alan Cox, Arjan van de Ven, Andrew Morton,
	Peter Zijlstra, Nick Piggin, David Rees, Jesper Krogh,
	Linux Kernel Mailing List, david, tj

Chris Mason <chris.mason@oracle.com> writes:
>
> As far as I know, reiserfs is the only one actively using it to choose
> different code.  It moves a single wait_on_buffer when barriers are on,
> which I took out once to simplify the code.  Ric saw it in some
> benchmark numbers and I put it back in.
>
> Given that it was a long time ago, I don't have a problem with changing
> it to work like all the other filesystems.

When it was a win on reiserfs back then maybe it would be a win
on ext4 or xfs today too?

-Andi


-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: [PATCH 1/7] block: Add block_flush_device()
  2009-03-30 20:09                                                                 ` Andi Kleen
@ 2009-03-30 20:15                                                                   ` Chris Mason
  0 siblings, 0 replies; 664+ messages in thread
From: Chris Mason @ 2009-03-30 20:15 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Jeff Garzik, Jens Axboe, Linus Torvalds,
	Fernando Luis Vázquez Cao, Christoph Hellwig, Theodore Tso,
	Ingo Molnar, Alan Cox, Arjan van de Ven, Andrew Morton,
	Peter Zijlstra, Nick Piggin, David Rees, Jesper Krogh,
	Linux Kernel Mailing List, david, tj

On Mon, 2009-03-30 at 22:09 +0200, Andi Kleen wrote:
> Chris Mason <chris.mason@oracle.com> writes:
> >
> > As far as I know, reiserfs is the only one actively using it to choose
> > different code.  It moves a single wait_on_buffer when barriers are on,
> > which I took out once to simplify the code.  Ric saw it in some
> > benchmark numbers and I put it back in.
> >
> > Given that it was a long time ago, I don't have a problem with changing
> > it to work like all the other filesystems.
> 
> When it was a win on reiserfs back then maybe it would be a win
> on ext4 or xfs today too?

It could be, but you get into some larger changes.  The theory behind
the code was that writeback cache is on, so wait_on_buffer isn't really
going to give you a worthwhile error return anyway.  Might as well do
the wait_on_buffer some time later and fix up the commit blocks if it
didn't work out.

We're still arguing about barriers being a good idea all these years
later, and the drives are better at them than they used to be.  So, I'd
rather see less complex code in the filesystems than more.

-chris



^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: [PATCH 1/7] block: Add block_flush_device()
  2009-03-30 19:45                                                             ` Linus Torvalds
@ 2009-03-30 20:17                                                               ` Jens Axboe
  2009-03-30 20:36                                                                 ` Linus Torvalds
  2009-03-30 20:52                                                                 ` Mark Lord
  0 siblings, 2 replies; 664+ messages in thread
From: Jens Axboe @ 2009-03-30 20:17 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Fernando Luis Vázquez Cao, Jeff Garzik, Christoph Hellwig,
	Theodore Tso, Ingo Molnar, Alan Cox, Arjan van de Ven,
	Andrew Morton, Peter Zijlstra, Nick Piggin, David Rees,
	Jesper Krogh, Linux Kernel Mailing List, chris.mason, david, tj

On Mon, Mar 30 2009, Linus Torvalds wrote:
> 
> 
> On Mon, 30 Mar 2009, Jens Axboe wrote:
> > 
> > Sorry, I just don't see much point to doing it this way instead. So now
> > the fs will have to check a queue bit after it has issued the flush, how
> > is that any better than having the 'error' returned directly?
> 
> No.
> 
> Now the fs SHOULD NEVER CHECK AT ALL.
> 
> Either it did the ordering, or the FS cannot do anything about it. 
> 
> That's the point. EOPNOTSUPP is n ot a useful error message. You can't 
> _do_ anything about it.

My point is that some file systems may or may not have different paths
or optimizations depending on whether barriers are enabled and working
or not. Apparently that's just reiserfs and Chris says we can remove it,
so it is probably a moot point.

And that is for the barrier write btw, NOT blkdev_issue_flush(). For the
latter it obviously doesn't matter if you return -EOPNOTSUPP or not, as
long as you warn. And before you yell on, please read on below...

> > > Sure, the device may or may not re-order things, but since the caller 
> > > can't know, and can't really do a thing about it _anyway_, you're just 
> > > better off not even confusing anybody.
> > 
> > I'd call that a pretty reckless approach to data integrity, honestly.
> 
> It has _nothing_ to do with 'reckless'. It has everything to do with 'you 
> can't do anything about it'.

No, but you better damn well inform of such a discovery!

> > You HAVE to issue an error in this case.
> 
> No. Returning an error just means that now the box is useless. Nobody can 
> do anything about it. Not the admin, not the driver writer, not anybody. 

What, that's nonsense. The admin can certainly check whether it's an
issue or not, and he should. That's different from handling it in the
kernel or in the application, but you have to inform about it. I
honestly cannot fathom why you don't think that is important.

> Ok, so a device didn't support flushing. We don't know why, we don't know 
> if it needed it, we simply don't know. There's nothing to do. But 
> returning an error to user mode is unacceptable, because that will result 
> in everything just -failing-. 
> 
> And total failure is much worse than "we don't know whether the thing 
> serialized".

That is not what I meant with returning it to the user. My point was
that you have to notify that the error occured, which means putting a
printk() (or whatever) in that blkdev_issue_flush(). I guess most of the
miscommunication stems from this, I don't want -EOPNOTSUPP returned to
user space, but I want some notification that tells the admin that this
device doesn't support flushes. And if the file systems use the same
path for barrier or no barriers, then it's perfectly fine to have them
share the very same "flush doesn't work bit" and the same single warning
that we don't know whether ordering is preserved on this device or not.

IOW, what I'm advocating is just a simple:

@@
        if (err == -EOPNOTSUPP) {
+               if (!is_queue_noflush(q)) {
+                       warn();
                        set_queue_noflush(q);
                }
        }

change to the pseudo-patch you posted.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-30 18:52                                                                                                               ` Chris Mason
@ 2009-03-30 20:19                                                                                                                 ` Mark Lord
  0 siblings, 0 replies; 664+ messages in thread
From: Mark Lord @ 2009-03-30 20:19 UTC (permalink / raw)
  To: Chris Mason
  Cc: Linus Torvalds, Ric Wheeler, Andreas T.Auer, Alan Cox,
	Theodore Tso, Stefan Richter, Jeff Garzik, Matthew Garrett,
	Andrew Morton, David Rees, Jesper Krogh,
	Linux Kernel Mailing List

Chris Mason wrote:
> On Mon, 2009-03-30 at 14:39 -0400, Mark Lord wrote:
>> Chris Mason wrote:
>>> I had some fun trying things with this, and I've been able to reliably
>>> trigger stalls in write cache of ~60 seconds on my seagate 500GB sata
>>> drive.  The worst I saw was 214 seconds.
>> ..
>>
>> I'd be more interested in how you managed that (above),
>> than the quite different test you describe below.
>>
>> Yes, different, I think.  The test below just times how long a single
>> chunk of data might stay in-drive cache under constant load,
>> rather than how long it takes to flush the drive cache on command.
>>
>> Right?
>>
>> Still, useful for other stuff.
>>
> 
> That's right, it is testing for starvation in a single sector, not for
> how long the cache flush actually takes.  But, your remark from higher
> up in the thread was this:
> 
>         > 
>         > Anything in the drive's write cache very probably made 
>         > it to the media within a second or two of arriving there.
..

Yeah, but that was in the context of how long the drive takes
to clear out it's cache when there's a (brief) break in the action.

Still, it's really good to see hard data on a drive that actually
starves itself for an extended period.  Very handy insight, that!
         
> Sorry if I misread things.  But the goal is just to show that it really
> does matter if we use a writeback cache with or without barriers.  The
> test has two datasets:
> 
> 1) An area that is constantly overwritten sequentially
> 2) A single sector that stores a critical bit of data.
> 
> #1 is the filesystem log, #2 is the filesystem super.  This isn't a
> specialized workload ;)
..

Good points.

I'm thinking of perhaps acquiring an OCZ Vertex SSD.
The 120GB ones apparently have 64MB of RAM inside,
much of which is used to cache data heading to the flash.

I wonder how long it takes to empty out that sucker!

Cheers

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-30 19:41                                                                                                     ` Jeff Garzik
@ 2009-03-30 20:21                                                                                                       ` Michael Tokarev
  2009-03-30 20:26                                                                                                         ` Mark Lord
  2009-03-30 20:34                                                                                                         ` Jeff Garzik
  0 siblings, 2 replies; 664+ messages in thread
From: Michael Tokarev @ 2009-03-30 20:21 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Rik van Riel, Linus Torvalds, Ric Wheeler, Andreas T.Auer,
	Alan Cox, Theodore Tso, Mark Lord, Stefan Richter,
	Matthew Garrett, Andrew Morton, David Rees, Jesper Krogh,
	Linux Kernel Mailing List

Jeff Garzik wrote:
[]
> IDEALLY, according to the SATA protocol spec, we could issue up to 32 
> NCQ commands to a SATA drive, each marked with the "FUA" bit to force 
> the command to hit permanent media before returning.
> 
> In theory, this NCQ+FUA mode gives the drive maximum ability to optimize 
> parallel in-progress commands, decoupling command completion and command 
> issue -- while also giving the OS complete control of ordering by virtue 
> of emptying the SATA tagged command queue.
> 
> In practice, NCQ+FUA flat out did not work on early drives, and 
> performance was way under what you would expect for parallel write-thru 
> command execution.  I haven't benchmarked NCQ+FUA in a few years; it 
> might be worth revisiting.

But are there drives out there that actually supports FUA?

The only cases I've seen dmesg DIFFERENT from something like

  sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled,
       doesn't support DPO or FUA
       ^^^^^^^^^^^^^^^^^^^^^^^^^^

is with SOME SCSI drives.  Even most modern SAS drives I've seen
reports lack of support for DPO or FUA.  Or at least kernel
reports that.

In the SATA world, I've seen no single case.  Seagate (7200.9..7200.11,
Barracuda ES and ES2), WD (Caviar CE, Caviar Black, Caviar Green,
RE2 GP), Hitachi DeskStar and UltraStar (old and new), some others --
all the same, no DPO or FUA.

/mjt

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-30 20:21                                                                                                       ` Michael Tokarev
@ 2009-03-30 20:26                                                                                                         ` Mark Lord
  2009-03-30 20:29                                                                                                           ` Mark Lord
  2009-03-30 20:35                                                                                                           ` Jeff Garzik
  2009-03-30 20:34                                                                                                         ` Jeff Garzik
  1 sibling, 2 replies; 664+ messages in thread
From: Mark Lord @ 2009-03-30 20:26 UTC (permalink / raw)
  To: Michael Tokarev
  Cc: Jeff Garzik, Rik van Riel, Linus Torvalds, Ric Wheeler,
	Andreas T.Auer, Alan Cox, Theodore Tso, Stefan Richter,
	Matthew Garrett, Andrew Morton, David Rees, Jesper Krogh,
	Linux Kernel Mailing List

Michael Tokarev wrote:
>
> But are there drives out there that actually supports FUA?
..

Most (or all?) current model Hitachi Deskstar drives have it.

Cheers


^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-30 20:26                                                                                                         ` Mark Lord
@ 2009-03-30 20:29                                                                                                           ` Mark Lord
  2009-03-30 20:35                                                                                                           ` Jeff Garzik
  1 sibling, 0 replies; 664+ messages in thread
From: Mark Lord @ 2009-03-30 20:29 UTC (permalink / raw)
  To: Michael Tokarev
  Cc: Jeff Garzik, Rik van Riel, Linus Torvalds, Ric Wheeler,
	Andreas T.Auer, Alan Cox, Theodore Tso, Stefan Richter,
	Matthew Garrett, Andrew Morton, David Rees, Jesper Krogh,
	Linux Kernel Mailing List

Mark Lord wrote:
> Michael Tokarev wrote:
>>
>> But are there drives out there that actually supports FUA?
> ..
> 
> Most (or all?) current model Hitachi Deskstar drives have it.
..

Mmmm.. so does my notebook's WD 250GB drive.

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: [PATCH 1/7] block: Add block_flush_device()
  2009-03-30 19:59                                                               ` Linus Torvalds
@ 2009-03-30 20:31                                                                 ` Jeff Garzik
  0 siblings, 0 replies; 664+ messages in thread
From: Jeff Garzik @ 2009-03-30 20:31 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jens Axboe, Fernando Luis Vázquez Cao, Christoph Hellwig,
	Theodore Tso, Ingo Molnar, Alan Cox, Arjan van de Ven,
	Andrew Morton, Peter Zijlstra, Nick Piggin, David Rees,
	Jesper Krogh, Linux Kernel Mailing List, chris.mason, david, tj

Linus Torvalds wrote:
> On Mon, 30 Mar 2009, Jeff Garzik wrote:
>> Indeed -- if the drive tells us it failed the cache flush, it seems
>> self-evident that we should be passing that failure back to userspace where
>> possible.

> EOPNOTSUPP doesn't mean "the cache flush failed". It just means "I don't 
> support cache flushing".
> 
> No failure anywhere. See?

Hence my statement of

	the aim is simply to return zero rather than EOPNOTSUPP [...]
	which is quite reasonable

I think we are all getting a bit confused whether we are discussing

	(a) EOPNOTSUPP return value,
		or
	(b) _all possible_ blkdev_issue_flush() error return values.

As I read it, you are talking about (a) and Jens responded to (b).  But 
maybe I am wrong.

So I have these observations:

1) fsync(2) should not return EOPNOTSUPP, if the block device does not 
support cache flushing.  This seems to agree with Linus's patch.

2) A Linux filesystem MIGHT care about EOPNOTSUPP return value, as that 
return value does provide information about the future value of cache 
flushes.

3) However, at present NONE of the blkdev_issue_flush() callers use 
EOPNOTSUPP in any way.  In fact, none of the current callers check the 
return value at all.

4) Furthermore, handling lack of cache flush support at the block layer, 
rather than per-filesystem, makes more sense to me.

But I am biased towards storage, so what do I know :)

5) Based on observation #3, the current kernel should be changed to 
return USEFUL blkdev_issue_flush() return values back to userspace. 
Fernando's patches head in this direction, as does my most recent 
file_fsync patch.

	Jeff






^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-30 20:21                                                                                                       ` Michael Tokarev
  2009-03-30 20:26                                                                                                         ` Mark Lord
@ 2009-03-30 20:34                                                                                                         ` Jeff Garzik
  1 sibling, 0 replies; 664+ messages in thread
From: Jeff Garzik @ 2009-03-30 20:34 UTC (permalink / raw)
  To: Michael Tokarev
  Cc: Rik van Riel, Linus Torvalds, Ric Wheeler, Andreas T.Auer,
	Alan Cox, Theodore Tso, Mark Lord, Stefan Richter,
	Matthew Garrett, Andrew Morton, David Rees, Jesper Krogh,
	Linux Kernel Mailing List

Michael Tokarev wrote:
> In the SATA world, I've seen no single case.  Seagate (7200.9..7200.11,
> Barracuda ES and ES2), WD (Caviar CE, Caviar Black, Caviar Green,
> RE2 GP), Hitachi DeskStar and UltraStar (old and new), some others --
> all the same, no DPO or FUA.

If your drive supports NCQ, it is highly likely it supports FUA.

By default, the libata driver _pretends_ your drive does not support FUA.

grep the kernel source for libata_fua and check out the module parameter 
'fua'

	Jeff




^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-30 20:26                                                                                                         ` Mark Lord
  2009-03-30 20:29                                                                                                           ` Mark Lord
@ 2009-03-30 20:35                                                                                                           ` Jeff Garzik
  2009-03-30 20:40                                                                                                             ` Mark Lord
  1 sibling, 1 reply; 664+ messages in thread
From: Jeff Garzik @ 2009-03-30 20:35 UTC (permalink / raw)
  To: Mark Lord
  Cc: Michael Tokarev, Rik van Riel, Linus Torvalds, Ric Wheeler,
	Andreas T.Auer, Alan Cox, Theodore Tso, Stefan Richter,
	Matthew Garrett, Andrew Morton, David Rees, Jesper Krogh,
	Linux Kernel Mailing List

Mark Lord wrote:
> Michael Tokarev wrote:
>>
>> But are there drives out there that actually supports FUA?
> ..
> 
> Most (or all?) current model Hitachi Deskstar drives have it.

Depends on your source of information:  if you judge from probe 
messages, libata_fua==0 will imply !FUA-support.

	Jeff




^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: [PATCH 1/7] block: Add block_flush_device()
  2009-03-30 20:17                                                               ` Jens Axboe
@ 2009-03-30 20:36                                                                 ` Linus Torvalds
  2009-03-31  2:14                                                                   ` Ric Wheeler
  2009-03-31  6:01                                                                   ` Jens Axboe
  2009-03-30 20:52                                                                 ` Mark Lord
  1 sibling, 2 replies; 664+ messages in thread
From: Linus Torvalds @ 2009-03-30 20:36 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Fernando Luis Vázquez Cao, Jeff Garzik, Christoph Hellwig,
	Theodore Tso, Ingo Molnar, Alan Cox, Arjan van de Ven,
	Andrew Morton, Peter Zijlstra, Nick Piggin, David Rees,
	Jesper Krogh, Linux Kernel Mailing List, chris.mason, david, tj



On Mon, 30 Mar 2009, Jens Axboe wrote:
> > 
> > It has _nothing_ to do with 'reckless'. It has everything to do with 'you 
> > can't do anything about it'.
> 
> No, but you better damn well inform of such a discovery!

Well, if that's the issue, then just add a printk to that 
'blkdev_issue_flush()', and now you have that informational message in 
_one_ place, instead of havign each filesystem having to do it over and 
over again.

> > No. Returning an error just means that now the box is useless. Nobody can 
> > do anything about it. Not the admin, not the driver writer, not anybody. 
> 
> What, that's nonsense. The admin can certainly check whether it's an
> issue or not, and he should.

If it's just informational, then again - why should the filesystem care?

Returning an error to the caller is never the right thing to do. The 
caller can't do anything sane about it.

If you argue that the admin wants to know, then sure, make that

                if (bio_flagged(bio, BIO_EOPNOTSUPP))
        -               ret = -EOPNOTSUPP;
        +               set_queue_noflush(q);

"set_queue_noflush()" function print a warning message when it sets the 
bit.

Problem solved.

> That's different from handling it in the kernel or in the application, 
> but you have to inform about it. I honestly cannot fathom why you don't 
> think that is important.

I cannot fathom why you can _possibly_ think that this is something that 
can and must be done something about in the caller. When the caller 
obviously has no real option except to ignore the error _anyway_.

That was always my point. Returning an error is INSANE, because ther is no 
valid thing that the caller can possibly do.

If you want it logged, fine. But THAT DOES NOT CHANGE ANYTHING. It would 
still be wrong to return the error, since the caller _still_ can't do 
anything about it.

		Linus

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-30 20:35                                                                                                           ` Jeff Garzik
@ 2009-03-30 20:40                                                                                                             ` Mark Lord
  0 siblings, 0 replies; 664+ messages in thread
From: Mark Lord @ 2009-03-30 20:40 UTC (permalink / raw)
  To: Michael Tokarev
  Cc: Jeff Garzik, Rik van Riel, Linus Torvalds, Ric Wheeler,
	Andreas T.Auer, Alan Cox, Theodore Tso, Stefan Richter,
	Matthew Garrett, Andrew Morton, David Rees, Jesper Krogh,
	Linux Kernel Mailing List

Jeff Garzik wrote:
> Mark Lord wrote:
>> Michael Tokarev wrote:
>>>
>>> But are there drives out there that actually supports FUA?
>> ..
>>
>> Most (or all?) current model Hitachi Deskstar drives have it.
> 
> Depends on your source of information:  if you judge from probe 
> messages, libata_fua==0 will imply !FUA-support.
..

As your other post points out, lots of drives already support FUA,
but libata deliberately disables it by default (due to the performance
impact, similar to mounting a f/s with -osync).


For the curious, you can use this command to see if your hardware has FUA:

   hdparm -I /dev/sd? | grep FUA

It will show lines like this for the drives that support it:
 
  *    WRITE_{DMA|MULTIPLE}_FUA_EXT

Cheers

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: [PATCH 1/7] block: Add block_flush_device()
  2009-03-30 20:17                                                               ` Jens Axboe
  2009-03-30 20:36                                                                 ` Linus Torvalds
@ 2009-03-30 20:52                                                                 ` Mark Lord
  2009-03-30 20:57                                                                   ` Jeff Garzik
                                                                                     ` (2 more replies)
  1 sibling, 3 replies; 664+ messages in thread
From: Mark Lord @ 2009-03-30 20:52 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Linus Torvalds, Fernando Luis Vázquez Cao, Jeff Garzik,
	Christoph Hellwig, Theodore Tso, Ingo Molnar, Alan Cox,
	Arjan van de Ven, Andrew Morton, Peter Zijlstra, Nick Piggin,
	David Rees, Jesper Krogh, Linux Kernel Mailing List, chris.mason,
	david, tj

Jens Axboe wrote:
> On Mon, Mar 30 2009, Linus Torvalds wrote:
>>
>> On Mon, 30 Mar 2009, Jens Axboe wrote:
>>> Sorry, I just don't see much point to doing it this way instead. So now
>>> the fs will have to check a queue bit after it has issued the flush, how
>>> is that any better than having the 'error' returned directly?
>> No.
>>
>> Now the fs SHOULD NEVER CHECK AT ALL.
>>
>> Either it did the ordering, or the FS cannot do anything about it. 
>>
>> That's the point. EOPNOTSUPP is n ot a useful error message. You can't 
>> _do_ anything about it.
> 
> My point is that some file systems may or may not have different paths
> or optimizations depending on whether barriers are enabled and working
> or not. Apparently that's just reiserfs and Chris says we can remove it,
> so it is probably a moot point.
..

XFS appears to have something along those lines.
I believe it tries to disable the drive write caches
if it discovers that it cannot do cache flushes.

I'll check next time my MythTV box boots up.
It has a RAID0 under XFS, and the md raid0 code doesn't
appear to pass the cache flushes to libata for raid0,
so XFS complains and tries to turn off the write caches.

And I have a script to damn well turn them back ON again
after it does so.  Stupid thing tries to override user policy again.

Cheers

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: [PATCH 5/7] vfs: Add  wbcflush sysfs knob to disable storage device writeback cache flushes
  2009-03-30 19:09                                                             ` Jeff Garzik
@ 2009-03-30 20:56                                                               ` Bartlomiej Zolnierkiewicz
  2009-03-30 22:01                                                                 ` Jeff Garzik
  0 siblings, 1 reply; 664+ messages in thread
From: Bartlomiej Zolnierkiewicz @ 2009-03-30 20:56 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Jens Axboe, Fernando Luis Vázquez Cao, Christoph Hellwig,
	Linus Torvalds, Theodore Tso, Ingo Molnar, Alan Cox,
	Arjan van de Ven, Andrew Morton, Peter Zijlstra, Nick Piggin,
	David Rees, Jesper Krogh, Linux Kernel Mailing List, chris.mason,
	david, tj

On Monday 30 March 2009, Jeff Garzik wrote:
> Jens Axboe wrote:
> > On Mon, Mar 30 2009, Jeff Garzik wrote:
> >> Jens Axboe wrote:
> >>> On Mon, Mar 30 2009, Bartlomiej Zolnierkiewicz wrote:
> >>>> On Monday 30 March 2009, Fernando Luis Vázquez Cao wrote:
> >>>>> Add a sysfs knob to disable storage device writeback cache flushes.
> >>>> The horde of casual desktop users (with me included) would probably prefer
> >>>> having two settings -- one for filesystem barriers and one for fsync().
> >>>>
> >>>> IOW I prefer higher performance at the cost of risking losing few last
> >>>> seconds/minutes of work in case of crash / powerfailure but I would still
> >>>> like to have the filesystem in the consistent state after such accident.
> >>> The knob is meant to control whether we really need to send a flush to
> >>> the device or not, so it's an orthogonal issue to what you are talking
> >>> about. For battery backed caches, we never need to flush. This knob is
> >>> useful IFF we have devices with write back caches that STILL do a cache
> >>> flush.
> >> How do installers and/or kernels detect a battery-backed cache that does
> >> not need flush?
> > 
> > They obviously can't, otherwise it would not be an issue at all. And
> > whether it's an issue is up for debate, until someone can point at such
> > a device. You could add a white/blacklist.
> 
> Sorry, I guess I misinterpreted your dual "IFF" statement :)
> 
> I completely agree that the suggested knob, for disabling cache flush 
> for these battery-backed devices, is at the present time addressing an 
> entirely theoretical argument AFAICS.

Guys, please look at the patch in the context of whole patchset posted
not the current Linus' tree context only.

Patch #4 adds mandatory cache flush to fsync() (based on earlier Jeff's
submission I guess) and patch #5 (this patch) adds a knob to disable cache
flushing completely.

If patch #4 is going to be ever applied I want to have an option to disable
mandatory cache flushing on fsync() -- I don't need it and I don't want it
on my desktop (+ I somehow believe I'm not the only one).  OTOH I do need it
and I do want it on my server (+ to be on by default).

Actually legacy fsync() syscall is pretty bad interface in itself because:

	a) it is synchronous

	b) operates only on a single fd

and it encourages some pretty stupid (performance wise) usages like
calling fsync() after every mail fetched.  Adding mandatory cache flush
to it only makes things worse (again looking from performance POV).

BTW in Linux world we never made any guarantees for fsync() on devices
using write caching:

$ man fsync
...
       If  the  underlying  hard disk has write caching enabled, then the data
       may not really be on  permanent  storage  when  fsync()  /  fdatasync()
       return.
...

aio_fsync() looks a bit better on a paper but no filesystem implements
it currently...

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: [PATCH 1/7] block: Add block_flush_device()
  2009-03-30 20:52                                                                 ` Mark Lord
@ 2009-03-30 20:57                                                                   ` Jeff Garzik
  2009-03-31 13:16                                                                   ` Chris Mason
  2009-03-31 15:49                                                                   ` Eric Sandeen
  2 siblings, 0 replies; 664+ messages in thread
From: Jeff Garzik @ 2009-03-30 20:57 UTC (permalink / raw)
  To: Mark Lord
  Cc: Jens Axboe, Linus Torvalds, Fernando Luis Vázquez Cao,
	Christoph Hellwig, Theodore Tso, Ingo Molnar, Alan Cox,
	Arjan van de Ven, Andrew Morton, Peter Zijlstra, Nick Piggin,
	David Rees, Jesper Krogh, Linux Kernel Mailing List, chris.mason,
	david, tj

Mark Lord wrote:
> Jens Axboe wrote:
>> My point is that some file systems may or may not have different paths
>> or optimizations depending on whether barriers are enabled and working
>> or not. Apparently that's just reiserfs and Chris says we can remove it,
>> so it is probably a moot point.

> XFS appears to have something along those lines.
> I believe it tries to disable the drive write caches
> if it discovers that it cannot do cache flushes.

Perhaps; but speaking specifically about blkdev_issue_flush() --

nobody checks its return value at the present time.

	Jeff




^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-27 16:02       ` Linus Torvalds
  2009-03-28  7:50         ` Mike Galbraith
@ 2009-03-30 22:00         ` Hans-Peter Jansen
  2009-03-30 22:07           ` Arjan van de Ven
  2009-04-02 19:01           ` Andreas T.Auer
  1 sibling, 2 replies; 664+ messages in thread
From: Hans-Peter Jansen @ 2009-03-30 22:00 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Mike Galbraith, Geert Uytterhoeven, linux-kernel, arjan

Am Freitag, 27. März 2009 schrieb Linus Torvalds:

> In other words, the main Makefile version is totally useless in
> non-linear development, and is meaningful _only_ at specific release
> times. In between releases, it's essentially a random thing, since
> non-linear development means that versioning simply fundamentally isn't
> some simple monotonic numbering. And this is exactly when
> CONFIG_LOCALVERSION_AUTO is a huge deal.

Well, you guys always see things from a deeply involved kernel developer 
_using git_ POV - which I do understand and accept (unlike hats nobody can 
change his head after all ;-), but there are other approaches to kernel 
source code, e.g. git is also really great for tracking the kernel 
development without any further involvement apart from using the resulting 
trees.

I build kernel rpms from your git tree, and have a bunch of BUILDs lying 
around. Sure, I can always fetch the tarballs or fiddle with git, but why? 
Having a Makefile start commit allows to make sure with simplest tools, 
say "head Makefile" that a locally copied 2.6.29 tree is really a 2.6.29, 
and not something moving towards the next release. That's all, nothing 
less, nothing more, it's just a strong hint which blend is in the box.

I always wonder, why Arjan does not intervene for his kerneloops.org 
project, since your approach opens a window of uncertainty during the merge 
window when simply using git as an efficient fetch tool.

Ducks and hides now,
Pete

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: [PATCH 5/7] vfs: Add  wbcflush sysfs knob to disable storage device writeback cache flushes
  2009-03-30 20:56                                                               ` Bartlomiej Zolnierkiewicz
@ 2009-03-30 22:01                                                                 ` Jeff Garzik
  0 siblings, 0 replies; 664+ messages in thread
From: Jeff Garzik @ 2009-03-30 22:01 UTC (permalink / raw)
  To: Bartlomiej Zolnierkiewicz
  Cc: Jens Axboe, Fernando Luis Vázquez Cao, Christoph Hellwig,
	Linus Torvalds, Theodore Tso, Ingo Molnar, Alan Cox,
	Arjan van de Ven, Andrew Morton, Peter Zijlstra, Nick Piggin,
	David Rees, Jesper Krogh, Linux Kernel Mailing List, chris.mason,
	david, tj

Bartlomiej Zolnierkiewicz wrote:
> calling fsync() after every mail fetched.  Adding mandatory cache flush
> to it only makes things worse (again looking from performance POV).
> 
> BTW in Linux world we never made any guarantees for fsync() on devices
> using write caching:

Quite true, but I've always thought that was trading away correctness 
for performance...  at a critical juncture where a consistency 
checkpoint was explicitly requested by the app.

My ideal would probably be blkdev cache flushing by default on fsync(2), 
with a block layer "desktop mode" knob to turn it off if you don't want it.

The current alternatives -- mount sync or disable blkdev writeback cache 
-- are far, far slower and punish the entire system just to provide a 
consistency checkpoint for a handful of fsync-needful apps.

	Jeff



^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-30 22:00         ` Hans-Peter Jansen
@ 2009-03-30 22:07           ` Arjan van de Ven
  2009-03-30 10:18             ` Pavel Machek
                               ` (3 more replies)
  2009-04-02 19:01           ` Andreas T.Auer
  1 sibling, 4 replies; 664+ messages in thread
From: Arjan van de Ven @ 2009-03-30 22:07 UTC (permalink / raw)
  To: Hans-Peter Jansen
  Cc: Linus Torvalds, Mike Galbraith, Geert Uytterhoeven, linux-kernel

Hans-Peter Jansen wrote:
> Am Freitag, 27. März 2009 schrieb Linus Torvalds:
> 
> I always wonder, why Arjan does not intervene for his kerneloops.org 
> project, since your approach opens a window of uncertainty during the merge 
> window when simply using git as an efficient fetch tool.

I would *love* it if Linus would, as first commit mark his tree as "-git0"
(as per snapshots) or "-rc0". So that I can split the "final" versus
"merge window" oopses.




^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: [PATCH 2/7] ext3: call blkdev_issue_flush() on fsync()
  2009-03-30 14:33                                                         ` Theodore Tso
@ 2009-03-31  1:26                                                           ` Tejun Heo
  2009-03-31  1:58                                                             ` Theodore Tso
  2009-03-31 11:18                                                             ` Jens Axboe
  0 siblings, 2 replies; 664+ messages in thread
From: Tejun Heo @ 2009-03-31  1:26 UTC (permalink / raw)
  To: Theodore Tso, Chris Mason, Fernando Luis Vázquez Cao,
	Jeff Garzik, Christoph Hellwig, Linus Torvalds, Ingo Molnar,
	Alan Cox, Arjan van de Ven, Andrew Morton, Peter Zijlstra,
	Nick Piggin, David Rees, Jesper Krogh, Linux Kernel Mailing List,
	david, tj

Hello,

Theodore Tso wrote:
> On Mon, Mar 30, 2009 at 10:15:51AM -0400, Chris Mason wrote:
>> I'm not sure we want to stick Fernando with changing how barriers are
>> done in individual filesystems, his patch is just changing the existing
>> call points.
> 
> Well, his patch actually added some calls to block_issue_flush().  But
> yes, it's probably better if he just changes the existing call points,
> and we can have the relevant filesystem maintainers double check to
> make sure that there aren't any new call points which are needed.

How about having something like blk_ensure_cache_flushed() which
issues flush iff there hasn't been any write since the last flush?
It'll be easy to implement and will filter out duplicate flushes in
most cases.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: [PATCH 2/7] ext3: call blkdev_issue_flush() on fsync()
  2009-03-31  1:26                                                           ` Tejun Heo
@ 2009-03-31  1:58                                                             ` Theodore Tso
  2009-03-31  2:14                                                               ` Tejun Heo
  2009-03-31 11:18                                                             ` Jens Axboe
  1 sibling, 1 reply; 664+ messages in thread
From: Theodore Tso @ 2009-03-31  1:58 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Chris Mason, Fernando Luis Vázquez Cao, Jeff Garzik,
	Christoph Hellwig, Linus Torvalds, Ingo Molnar, Alan Cox,
	Arjan van de Ven, Andrew Morton, Peter Zijlstra, Nick Piggin,
	David Rees, Jesper Krogh, Linux Kernel Mailing List, david

On Tue, Mar 31, 2009 at 10:26:58AM +0900, Tejun Heo wrote:
> 
> How about having something like blk_ensure_cache_flushed() which
> issues flush iff there hasn't been any write since the last flush?
> It'll be easy to implement and will filter out duplicate flushes in
> most cases.
> 

I thought about such a thing, but my concern is that while this might
suppress most unnecessary double flushes, some intervening write from
another process might slip in which doesn't need to be flushed out.
In other words "in most cases" means that "in some cases" we will take
a performance hit thanks to the duplicate flushes.  So this isn't
something we should depend upon, although if we do detect back-to-back
flushes, obviously we should filter them out.

So if we did something like this, it would be good if we had a
debugging option which would detect double flushes, and printk a
warning identifying where the call sites first and second flushes (by
function name and line number), so that filesystem developers could
detect the double flushes, and work to eliminate them.

Does that make sense?

              	   	  	       	    - Ted

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: [PATCH 2/7] ext3: call blkdev_issue_flush() on fsync()
  2009-03-31  1:58                                                             ` Theodore Tso
@ 2009-03-31  2:14                                                               ` Tejun Heo
  0 siblings, 0 replies; 664+ messages in thread
From: Tejun Heo @ 2009-03-31  2:14 UTC (permalink / raw)
  To: Theodore Tso, Tejun Heo, Chris Mason,
	Fernando Luis Vázquez Cao, Jeff Garzik, Christoph Hellwig,
	Linus Torvalds, Ingo Molnar, Alan Cox, Arjan van de Ven,
	Andrew Morton, Peter Zijlstra, Nick Piggin, David Rees,
	Jesper Krogh, Linux Kernel Mailing List, david

Hello,

Theodore Tso wrote:
> On Tue, Mar 31, 2009 at 10:26:58AM +0900, Tejun Heo wrote:
>> How about having something like blk_ensure_cache_flushed() which
>> issues flush iff there hasn't been any write since the last flush?
>> It'll be easy to implement and will filter out duplicate flushes in
>> most cases.
> 
> I thought about such a thing, but my concern is that while this might
> suppress most unnecessary double flushes, some intervening write from
> another process might slip in which doesn't need to be flushed out.
> In other words "in most cases" means that "in some cases" we will take
> a performance hit thanks to the duplicate flushes.  So this isn't
> something we should depend upon, although if we do detect back-to-back
> flushes, obviously we should filter them out.

Yeah, well, it all comes down to how most the "most" is.  If all
that's between the first flush and the second one are some code the
cpu has to eat through, I don't think there's high chance of an IO
going inbetween unless the IO was already there and gets scheduled
inbetween (which can be avoided).

The thing is that detecting dup is possible but missing is not.  If
flush is missing in certain corner paths, nobody would know till
somebody reviews the code.  Even when the problem triggers, it would
be rare and obscure enough to avoid proper diagnosis, so I think if
the "most" is most enough, it could be the better way to do it.  But,
then again, I'm not a FS guy, so if such thing can be guaranteed in
FSes without too much problem, no need to pull such a stunt at the
block layer.

> So if we did something like this, it would be good if we had a
> debugging option which would detect double flushes, and printk a
> warning identifying where the call sites first and second flushes (by
> function name and line number), so that filesystem developers could
> detect the double flushes, and work to eliminate them.
> 
> Does that make sense?

Yeap, that definitely sounds like a good idea.  I'll put it on my todo
list.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: [PATCH 1/7] block: Add block_flush_device()
  2009-03-30 20:36                                                                 ` Linus Torvalds
@ 2009-03-31  2:14                                                                   ` Ric Wheeler
  2009-03-31  2:47                                                                     ` Linus Torvalds
  2009-03-31  6:01                                                                   ` Jens Axboe
  1 sibling, 1 reply; 664+ messages in thread
From: Ric Wheeler @ 2009-03-31  2:14 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jens Axboe, Fernando Luis Vázquez Cao, Jeff Garzik,
	Christoph Hellwig, Theodore Tso, Ingo Molnar, Alan Cox,
	Arjan van de Ven, Andrew Morton, Peter Zijlstra, Nick Piggin,
	David Rees, Jesper Krogh, Linux Kernel Mailing List, chris.mason,
	david, tj

Linus Torvalds wrote:
> On Mon, 30 Mar 2009, Jens Axboe wrote:
>   
>>> It has _nothing_ to do with 'reckless'. It has everything to do with 'you 
>>> can't do anything about it'.
>>>       
>> No, but you better damn well inform of such a discovery!
>>     
>
> Well, if that's the issue, then just add a printk to that 
> 'blkdev_issue_flush()', and now you have that informational message in 
> _one_ place, instead of havign each filesystem having to do it over and 
> over again.
>
>   
>>> No. Returning an error just means that now the box is useless. Nobody can 
>>> do anything about it. Not the admin, not the driver writer, not anybody. 
>>>       
>> What, that's nonsense. The admin can certainly check whether it's an
>> issue or not, and he should.
>>     
>
> If it's just informational, then again - why should the filesystem care?
>
> Returning an error to the caller is never the right thing to do. The 
> caller can't do anything sane about it.
>
> If you argue that the admin wants to know, then sure, make that
>
>                 if (bio_flagged(bio, BIO_EOPNOTSUPP))
>         -               ret = -EOPNOTSUPP;
>         +               set_queue_noflush(q);
>
> "set_queue_noflush()" function print a warning message when it sets the 
> bit.
>
> Problem solved.
>
>   
>> That's different from handling it in the kernel or in the application, 
>> but you have to inform about it. I honestly cannot fathom why you don't 
>> think that is important.
>>     
>
> I cannot fathom why you can _possibly_ think that this is something that 
> can and must be done something about in the caller. When the caller 
> obviously has no real option except to ignore the error _anyway_.
>
> That was always my point. Returning an error is INSANE, because ther is no 
> valid thing that the caller can possibly do.
>
> If you want it logged, fine. But THAT DOES NOT CHANGE ANYTHING. It would 
> still be wrong to return the error, since the caller _still_ can't do 
> anything about it.
>
> 		Linus
>   

One thing the caller could do is to disable the write cache on the 
device. A second would be to stop using the transactions - skip the 
journal, just go back to ext2 mode or BSD like soft updates.

Basically, it lets the file system know that its data integrity building 
blocks are not really there and allows it (if it cares) to try and 
minimize the chance of data loss.

Ric


^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: [PATCH 1/7] block: Add block_flush_device()
  2009-03-31  2:14                                                                   ` Ric Wheeler
@ 2009-03-31  2:47                                                                     ` Linus Torvalds
  2009-03-31  6:04                                                                       ` Jens Axboe
  2009-03-31 11:15                                                                       ` Ric Wheeler
  0 siblings, 2 replies; 664+ messages in thread
From: Linus Torvalds @ 2009-03-31  2:47 UTC (permalink / raw)
  To: Ric Wheeler
  Cc: Jens Axboe, Fernando Luis Vázquez Cao, Jeff Garzik,
	Christoph Hellwig, Theodore Tso, Ingo Molnar, Alan Cox,
	Arjan van de Ven, Andrew Morton, Peter Zijlstra, Nick Piggin,
	David Rees, Jesper Krogh, Linux Kernel Mailing List, chris.mason,
	david, tj



On Mon, 30 Mar 2009, Ric Wheeler wrote:
> 
> One thing the caller could do is to disable the write cache on the device.

First off, that's not the callers job. If the sysadmin enabled it, some 
random filesystem shouldn't disable it.

Secondly, this whole insane belief that "write cache" has anything to do 
with "unable to flush" is just bogus. 

> A second would be to stop using the transactions - skip the journal, 
> just go back to ext2 mode or BSD like soft updates.

f*ck me, what's so hard with understanding that EOPNOTSUPP doesn't mean 
"no ordering". It means what it says - the op isn't supported. For all you 
know, ALL WRITES MAY BE TOTALLY ORDERED, but perhaps there is no way to 
make a _single_ write totally atomic (ie the "set barrier on a command 
that actually does IO").

Besides, why the hell do you think the filesystem (again) should do 
something that the admin didn't ask it to do.

If the admin wants the thing to fall back to ext2, then he can ask to 
disable the journal.

> Basically, it lets the file system know that its data integrity building
> blocks are not really there and allows it (if it cares) to try and minimize
> the chance of data loss.

Your whole idiotic "as a filesystem designer I know better than everybody 
else" model where the filesystem is in total control is total crap.

The fact is, it's not the filesystems job to make that decision. If the 
admin wants to have write caching enabled, the filesystem should get the 
hell out of the way.

What about laptop mode? Do you expect your filesystem to always decide 
that "ok, the user wanted to spin down disks, but I know better"?

What about people who have UPS's and don't worry about that part? They 
want write caching on the disk, and simply don't want to sync? They still 
worry about OS crashing, since they run random -git development kernels?

In short, stop this IDIOTIC notion that you know better. YOU DO NOT KNOW 
BETTER. The filesystem DOES NOT KNOW BETTER. It should damn well not do 
those kinds of decisions that are simply not filesystem decisions to make!

			Linus

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: [PATCH 1/7] block: Add block_flush_device()
  2009-03-30 20:36                                                                 ` Linus Torvalds
  2009-03-31  2:14                                                                   ` Ric Wheeler
@ 2009-03-31  6:01                                                                   ` Jens Axboe
  1 sibling, 0 replies; 664+ messages in thread
From: Jens Axboe @ 2009-03-31  6:01 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Fernando Luis Vázquez Cao, Jeff Garzik, Christoph Hellwig,
	Theodore Tso, Ingo Molnar, Alan Cox, Arjan van de Ven,
	Andrew Morton, Peter Zijlstra, Nick Piggin, David Rees,
	Jesper Krogh, Linux Kernel Mailing List, chris.mason, david, tj

On Mon, Mar 30 2009, Linus Torvalds wrote:
> 
> 
> On Mon, 30 Mar 2009, Jens Axboe wrote:
> > > 
> > > It has _nothing_ to do with 'reckless'. It has everything to do with 'you 
> > > can't do anything about it'.
> > 
> > No, but you better damn well inform of such a discovery!
> 
> Well, if that's the issue, then just add a printk to that 
> 'blkdev_issue_flush()', and now you have that informational message in 
> _one_ place, instead of havign each filesystem having to do it over and 
> over again.

Right, that's exactly what I want :-)

> > > No. Returning an error just means that now the box is useless. Nobody can 
> > > do anything about it. Not the admin, not the driver writer, not anybody. 
> > 
> > What, that's nonsense. The admin can certainly check whether it's an
> > issue or not, and he should.
> 
> If it's just informational, then again - why should the filesystem care?
> 
> Returning an error to the caller is never the right thing to do. The 
> caller can't do anything sane about it.
> 
> If you argue that the admin wants to know, then sure, make that
> 
>                 if (bio_flagged(bio, BIO_EOPNOTSUPP))
>         -               ret = -EOPNOTSUPP;
>         +               set_queue_noflush(q);
> 
> "set_queue_noflush()" function print a warning message when it sets the 
> bit.
> 
> Problem solved.
> 
> > That's different from handling it in the kernel or in the application, 
> > but you have to inform about it. I honestly cannot fathom why you don't 
> > think that is important.
> 
> I cannot fathom why you can _possibly_ think that this is something that 
> can and must be done something about in the caller. When the caller 
> obviously has no real option except to ignore the error _anyway_.
> 
> That was always my point. Returning an error is INSANE, because ther is no 
> valid thing that the caller can possibly do.
> 
> If you want it logged, fine. But THAT DOES NOT CHANGE ANYTHING. It would 
> still be wrong to return the error, since the caller _still_ can't do 
> anything about it.

I don't want to return -EOPNOTSUPP, I think this thread has become
increasingly confusing. And it's probably largely due to me mixing write
barriers into it, if we stick purely to blkdev_issue_flush(), then
logging a warning and returning 0 is perfectly fine with me. My
objection was to ignoring the "I can't flush" error in the first place,
not returning 0.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: [PATCH 1/7] block: Add block_flush_device()
  2009-03-31  2:47                                                                     ` Linus Torvalds
@ 2009-03-31  6:04                                                                       ` Jens Axboe
  2009-03-31 11:15                                                                       ` Ric Wheeler
  1 sibling, 0 replies; 664+ messages in thread
From: Jens Axboe @ 2009-03-31  6:04 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ric Wheeler, Fernando Luis Vázquez Cao, Jeff Garzik,
	Christoph Hellwig, Theodore Tso, Ingo Molnar, Alan Cox,
	Arjan van de Ven, Andrew Morton, Peter Zijlstra, Nick Piggin,
	David Rees, Jesper Krogh, Linux Kernel Mailing List, chris.mason,
	david, tj

On Mon, Mar 30 2009, Linus Torvalds wrote:
> 
> 
> On Mon, 30 Mar 2009, Ric Wheeler wrote:
> > 
> > One thing the caller could do is to disable the write cache on the device.
> 
> First off, that's not the callers job. If the sysadmin enabled it, some 
> random filesystem shouldn't disable it.

Completely agree with that, that is why I want the error logged instead
of returned. The write cache MAY be involved, but it may also be
something entirely different. The cache may be perfectly fine and
ordered but just not supporting flush cache because it doesn't need to
(it has battery backing). The important bit is informing the admin of
the situation, then it's up to the admin to look into the storage stack
and determine if this is a real problem or not. There's nothing the
kernel can do about it.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: [PATCH 1/7] block: Add block_flush_device()
  2009-03-30 15:07                                                     ` Bartlomiej Zolnierkiewicz
@ 2009-03-31  6:09                                                       ` Fernando Luis Vázquez Cao
  0 siblings, 0 replies; 664+ messages in thread
From: Fernando Luis Vázquez Cao @ 2009-03-31  6:09 UTC (permalink / raw)
  To: Bartlomiej Zolnierkiewicz
  Cc: Jeff Garzik, Christoph Hellwig, Linus Torvalds, Theodore Tso,
	Ingo Molnar, Alan Cox, Arjan van de Ven, Andrew Morton,
	Peter Zijlstra, Nick Piggin, David Rees, Jesper Krogh,
	Linux Kernel Mailing List, chris.mason, david, tj

Bartlomiej Zolnierkiewicz wrote:
> On Monday 30 March 2009, Fernando Luis Vázquez Cao wrote:
>> This patch adds a helper function that should be used by filesystems that need
>> to flush the underlying block device on fsync()/fdatasync().
>>
>> Signed-off-by: Fernando Luis Vazquez Cao <fernando@oss.ntt.co.jp>
>> ---
>>
>> diff -urNp linux-2.6.29-orig/fs/buffer.c linux-2.6.29/fs/buffer.c
>> --- linux-2.6.29-orig/fs/buffer.c	2009-03-24 08:12:14.000000000 +0900
>> +++ linux-2.6.29/fs/buffer.c	2009-03-30 15:27:04.000000000 +0900
>> @@ -165,6 +165,17 @@ void end_buffer_write_sync(struct buffer
>>   	put_bh(bh);
>>   }
>>
>> +/* Issue flush of write caches on the block device */
>> +int block_flush_device(struct block_device *bdev)
> 
> I don't consider this an improvement over using blkdev_issue_flush().

The reason I used a wrapper is that I did not like the semantics provided
by blkdev_issue_flush(). On the one hand, I did not want to pass -EOPNOTSUPP
to filesystems (it is not an error filesystems should care about). On the
other hand it is weird that some filesystems use blkdev_issue_flush() when
they want emit a barrier. blkdev_issue_flush() happens to be implemented
as an empty (block layer) barrier, but I think that is an implementation
detail filesystems should not neet to know about. Indeed I am working on a
patch that implements blkdev_issue_empty_barrier(), so that we can optimize
fsync() flushes and filesystem-originated barriers independently in the block
layer.

Judging from your comments below, it seems we are in the same page regarding
this issue.

Again, thank you for you feedback!

- Fernando

>> +{
>> +	int ret = 0;
>> +
>> +	ret = blkdev_issue_flush(bdev, NULL);
> 
> The problem lies in using NULL for error_sector argument which shows
> a subtle deficiency of the current implementation/usage of barriers
> based on a write cache flushing.
> 
> I intend to document the issue with adding the FIXME to the current
> users of blkdev_issue_flush() so the problem is at least known and not
> forgotten (fixing it would require some work from both block and fs
> sides and unfortunately there wasn't even a willingness to discuss
> possible solutions few years back when the original code was added).
> 
> Thanks,
> Bart


^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: [PATCH 5/7] vfs: Add  wbcflush sysfs knob to disable storage device writeback cache flushes
  2009-03-30 14:35                                                         ` Jens Axboe
@ 2009-03-31  6:49                                                           ` Fernando Luis Vázquez Cao
  2009-03-31 10:38                                                             ` Jens Axboe
  0 siblings, 1 reply; 664+ messages in thread
From: Fernando Luis Vázquez Cao @ 2009-03-31  6:49 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Jeff Garzik, Christoph Hellwig, Linus Torvalds, Theodore Tso,
	Ingo Molnar, Alan Cox, Arjan van de Ven, Andrew Morton,
	Peter Zijlstra, Nick Piggin, David Rees, Jesper Krogh,
	Linux Kernel Mailing List, chris.mason, david, tj

Jens Axboe wrote:
> On Mon, Mar 30 2009, Fernando Luis Vázquez Cao wrote:
>> Jens Axboe wrote:
>>> On Mon, Mar 30 2009, Fernando Luis Vázquez Cao wrote:
>>>> Add a sysfs knob to disable storage device writeback cache flushes.
>>>>
>>>> Signed-off-by: Fernando Luis Vazquez Cao <fernando@oss.ntt.co.jp>
>>>> ---
>>>>
>>>> diff -urNp linux-2.6.29-orig/block/blk-barrier.c linux-2.6.29/block/blk-barrier.c
>>>> --- linux-2.6.29-orig/block/blk-barrier.c	2009-03-24 08:12:14.000000000 +0900
>>>> +++ linux-2.6.29/block/blk-barrier.c	2009-03-30 17:08:28.000000000 +0900
>>>> @@ -318,6 +318,9 @@ int blkdev_issue_flush(struct block_devi
>>>>  	if (!q)
>>>>  		return -ENXIO;
>>>>
>>>> +	if (blk_queue_nowbcflush(q))
>>>> +		return -EOPNOTSUPP;
>>>> +
>>>>  	bio = bio_alloc(GFP_KERNEL, 0);
>>>>  	if (!bio)
>>>>  		return -ENOMEM;
>>>> diff -urNp linux-2.6.29-orig/block/blk-core.c linux-2.6.29/block/blk-core.c
>>>> --- linux-2.6.29-orig/block/blk-core.c	2009-03-24 08:12:14.000000000 +0900
>>>> +++ linux-2.6.29/block/blk-core.c	2009-03-30 17:08:28.000000000 +0900
>>>> @@ -1452,7 +1452,8 @@ static inline void __generic_make_reques
>>>>  			goto end_io;
>>>>  		}
>>>>  		if (bio_barrier(bio) && bio_has_data(bio) &&
>>>> -		    (q->next_ordered == QUEUE_ORDERED_NONE)) {
>>>> +		    (blk_queue_nowbcflush(q) ||
>>>> +		     q->next_ordered == QUEUE_ORDERED_NONE)) {
>>>>  			err = -EOPNOTSUPP;
>>>>  			goto end_io;
>>>>  		}
>>> This (and the above hunk) should be changed. -EOPNOTSUPP means the
>>> target does not support barriers, that is a different thing to flushes
>>> not being needed. A file system issuing a barrier and getting
>>> -EOPNOTSUPP back will disable barriers, since it now thinks that
>>> ordering cannot be guaranteed.
>> The reason I decided to use -EOPNOTSUPP was that I wanted to keep
>> barriers and device flushes from entering the block layer when
>> they are not needed. I feared that if we pass them down the block
>> stack (knowing in advance they will not be actually submitted to
>> disk) we may end up slowing things down unnecessarily.
> 
> But that's just wrong, you need to make sure that the block layer / io
> scheduler doesn't reorder as well. It's a lot more complex than just the
> device end. So just returning -EOPNOTSUPP and pretending that you need
> not use barriers at the fs end is just wrong.

I should have mentioned that in this patch set I was trying to tackle the
blkdev_issue_flush() case only. As you pointed out, with the code above
requests may get silently reordered across barriers inside the block layer.

The follow-up patch I am working on implements blkdev_issue_empty_barrier(),
which should be used by filesystems that want to emit an empty barrier (as
opposed to just triggering a device flush). Doing this we can optimize
fsync() flushes (block_flush_device()) and filesystem-originated barriers
(blkdev_issue_empty_barrier()) independently in the block layer.

I agree with you that the we should pass barriers down in
__generic_make_request, but the optimization above for fsync()-originated
blkdev_issue_flush()'s seems valid to me.

Does it make sense now?

>> As you mentioned, filesystems such as ext3/4 will disable
>> barriers if they get -EOPNOTSUPP when issuing one. I was planning
>> to add a notifier mechanism so that we can notify filesystems has
>> been a change in the barrier settings. This might be
>> over-engineering, though. Especially considering that "-o
>> remount,barrier=1" will bring us the barriers back.
> 
> I think that is over-engineering.

Yep, we certainly agree on that point.

>>> A more appropriate change would be to successfully complete a flush
>>> without actually sending it down to the device if blk_queue_nowbcflush()
>>> is true. Then blkdev_issue_flush() would just work as well. It also
>>> needs to take stacking into account, or stacked drivers will have to
>>> propagate the settings up the stack. If you allow simply the barrier to
>>> be passed down, you get that for free.
>> Aren't we risking slowing things down? Does the small optimization above
>> make sense (especially taking the remount trick into account)?
> 
> It's not, I think you are missing the bigger picture.

Sorry for not explaining myself properly. I will add a changelog and better
documentation for the patches.

Thank you for your feedback!

- Fernando

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-30 19:22                                                                                                   ` Rik van Riel
  2009-03-30 19:41                                                                                                     ` Jeff Garzik
  2009-03-30 20:05                                                                                                     ` Linus Torvalds
@ 2009-03-31  9:27                                                                                                     ` Neil Brown
  2009-03-31 21:13                                                                                                     ` Alan Cox
  3 siblings, 0 replies; 664+ messages in thread
From: Neil Brown @ 2009-03-31  9:27 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Linus Torvalds, Ric Wheeler, Andreas T.Auer, Alan Cox,
	Theodore Tso, Mark Lord, Stefan Richter, Jeff Garzik,
	Matthew Garrett, Andrew Morton, David Rees, Jesper Krogh,
	Linux Kernel Mailing List

On Monday March 30, riel@redhat.com wrote:
> Linus Torvalds wrote:
> > On Mon, 30 Mar 2009, Ric Wheeler wrote:
> 
> >> Heat is a major killer of spinning drives (as is severe cold). A lot of times,
> >> drives that have read errors only (not failed writes) might be fully
> >> recoverable if you can re-write that injured sector.
> > 
> > It's not worked for me, and yes, I've tried.
> 
> It's worked here.  It would be nice to have a device mapper module
> that can just insert itself between the disk and the higher device
> mapper layer and "scrub" the disk, fetching unreadable sectors from
> the other RAID copy where required.

You want to start using 'md' :-)
With raid0,1,4,5,6,10, if it gets a read error, it find the data from
elsewhere and tries to over-write the read error and then read back.
If that all works, then it assume the drive is still good.
This happens during normal IO and all when you 'scrub' the array which
e.g. Debian does on the first Sunday of the month by default.

NeilBrown

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-27 19:32                                                           ` Theodore Tso
  2009-03-27 20:11                                                             ` Andreas T.Auer
@ 2009-03-31  9:58                                                             ` Neil Brown
  1 sibling, 0 replies; 664+ messages in thread
From: Neil Brown @ 2009-03-31  9:58 UTC (permalink / raw)
  To: Theodore Tso
  Cc: Alan Cox, Linus Torvalds, Matthew Garrett, Andrew Morton,
	David Rees, Jesper Krogh, Linux Kernel Mailing List

On Friday March 27, tytso@mit.edu wrote:
> On Fri, Mar 27, 2009 at 07:14:26PM +0000, Alan Cox wrote:
> > > Agreed, we need a middle ground.  We need a transition path that
> > > recognizes that ext3 won't be the dominant filesystem for Linux in
> > > perpetuity, and that ext3's data=ordered semantics will someday no
> > > longer be a major factor in application design.  fbarrier() semantics
> > > might be one approach; there may be others.  It's something we need to
> > > figure out.
> > 
> > Would making close imply fbarrier() rather than fsync() work for this ?
> > That would give people the ordering they want even if they are less
> > careful but wouldn't give the media error cases - which are less
> > interesting.
> 
> The thought that I had was to create a new system call, fbarrier()
> which has the semantics that it will request the filesystem to make
> sure that (at least) changes that have been made data blocks to date
> should be forced out to disk when the next metadata operation is
> committed.

I'm curious about the exact semantics that you are suggesting.
Do you mean that
 1/ any data block in any file will be forced out before any metadata
    for any file? or
 2/ any data block for 'this' file will be forced out before any
    metadata for any file? or
 3/ any data block for 'this' file will be forced out before any 
    metadata for this file?

I assume the contents of directories are metadata.  If 3 is that case
do we included the metadata of any directories known to contain this
file?  Recursively?

I think that if we do introduce new semantics, they should be as weak
as possibly while still achieving the goal, so that fs designers have
as much freedom as possible.  It should also be as expressive as
possible so that we don't find we want to extend it later.

What would you think of:
   fcntl(fd, F_BEFORE, fd2)

with the semantics that it sets up a transaction dependency between fd
and fd2 and more particularly the operations requested through each
fd.

So if 'fd' is a file, and 'fd2' is the directory holding that file,
then
   fcntl(fd, F_BEFORE, fd2)
   write(fd, stuff)
   renameat(fd2, 'file', fd2, 'newname')

would ensure that the writes to the file were visible on storage
before the rename.
You could also do
   fd1 = open("afile", O_RDWR);
   fd2 = open("afile", O_RDWR);
   fcntl(fd1, F_BEFORE, fd2);

then use write(fd1) to write journal updates to one part of the
(database) file, and write(fd2) to write in-place updates,
and it would just "do the right thing". (You might want to call
fcntl(fd2, F_BEFORE, fd1) as well ... I haven't quite thought through
the details of that yet).

If you gave AT_FDCWD as the fd2 in the fcntl, then operations on fd1
would be ordered before any namespace operations which did not specify a
particular directory, which would be fairly close to option 2 above.

A minimal implementation could fsync fd1 before allowing any operation
on fd2.  A more sophisticated implementation could record set up
dependencies in internal data structures and start writeout of the fd1
changes without actually waiting for them to complete.


Just a thought....

NeilBrown

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: [PATCH 5/7] vfs: Add  wbcflush sysfs knob to disable storage device writeback cache flushes
  2009-03-31  6:49                                                           ` Fernando Luis Vázquez Cao
@ 2009-03-31 10:38                                                             ` Jens Axboe
  2009-03-31 11:56                                                               ` Fernando Luis Vázquez Cao
  0 siblings, 1 reply; 664+ messages in thread
From: Jens Axboe @ 2009-03-31 10:38 UTC (permalink / raw)
  To: Fernando Luis Vázquez Cao
  Cc: Jeff Garzik, Christoph Hellwig, Linus Torvalds, Theodore Tso,
	Ingo Molnar, Alan Cox, Arjan van de Ven, Andrew Morton,
	Peter Zijlstra, Nick Piggin, David Rees, Jesper Krogh,
	Linux Kernel Mailing List, chris.mason, david, tj

On Tue, Mar 31 2009, Fernando Luis Vázquez Cao wrote:
> Jens Axboe wrote:
>> On Mon, Mar 30 2009, Fernando Luis Vázquez Cao wrote:
>>> Jens Axboe wrote:
>>>> On Mon, Mar 30 2009, Fernando Luis Vázquez Cao wrote:
>>>>> Add a sysfs knob to disable storage device writeback cache flushes.
>>>>>
>>>>> Signed-off-by: Fernando Luis Vazquez Cao <fernando@oss.ntt.co.jp>
>>>>> ---
>>>>>
>>>>> diff -urNp linux-2.6.29-orig/block/blk-barrier.c linux-2.6.29/block/blk-barrier.c
>>>>> --- linux-2.6.29-orig/block/blk-barrier.c	2009-03-24 08:12:14.000000000 +0900
>>>>> +++ linux-2.6.29/block/blk-barrier.c	2009-03-30 17:08:28.000000000 +0900
>>>>> @@ -318,6 +318,9 @@ int blkdev_issue_flush(struct block_devi
>>>>>  	if (!q)
>>>>>  		return -ENXIO;
>>>>>
>>>>> +	if (blk_queue_nowbcflush(q))
>>>>> +		return -EOPNOTSUPP;
>>>>> +
>>>>>  	bio = bio_alloc(GFP_KERNEL, 0);
>>>>>  	if (!bio)
>>>>>  		return -ENOMEM;
>>>>> diff -urNp linux-2.6.29-orig/block/blk-core.c linux-2.6.29/block/blk-core.c
>>>>> --- linux-2.6.29-orig/block/blk-core.c	2009-03-24 08:12:14.000000000 +0900
>>>>> +++ linux-2.6.29/block/blk-core.c	2009-03-30 17:08:28.000000000 +0900
>>>>> @@ -1452,7 +1452,8 @@ static inline void __generic_make_reques
>>>>>  			goto end_io;
>>>>>  		}
>>>>>  		if (bio_barrier(bio) && bio_has_data(bio) &&
>>>>> -		    (q->next_ordered == QUEUE_ORDERED_NONE)) {
>>>>> +		    (blk_queue_nowbcflush(q) ||
>>>>> +		     q->next_ordered == QUEUE_ORDERED_NONE)) {
>>>>>  			err = -EOPNOTSUPP;
>>>>>  			goto end_io;
>>>>>  		}
>>>> This (and the above hunk) should be changed. -EOPNOTSUPP means the
>>>> target does not support barriers, that is a different thing to flushes
>>>> not being needed. A file system issuing a barrier and getting
>>>> -EOPNOTSUPP back will disable barriers, since it now thinks that
>>>> ordering cannot be guaranteed.
>>> The reason I decided to use -EOPNOTSUPP was that I wanted to keep
>>> barriers and device flushes from entering the block layer when
>>> they are not needed. I feared that if we pass them down the block
>>> stack (knowing in advance they will not be actually submitted to
>>> disk) we may end up slowing things down unnecessarily.
>>
>> But that's just wrong, you need to make sure that the block layer / io
>> scheduler doesn't reorder as well. It's a lot more complex than just the
>> device end. So just returning -EOPNOTSUPP and pretending that you need
>> not use barriers at the fs end is just wrong.
>
> I should have mentioned that in this patch set I was trying to tackle the
> blkdev_issue_flush() case only. As you pointed out, with the code above
> requests may get silently reordered across barriers inside the block layer.
>
> The follow-up patch I am working on implements blkdev_issue_empty_barrier(),
> which should be used by filesystems that want to emit an empty barrier (as
> opposed to just triggering a device flush). Doing this we can optimize
> fsync() flushes (block_flush_device()) and filesystem-originated barriers
> (blkdev_issue_empty_barrier()) independently in the block layer.

Not sure it makes sense to abstract that out into an api, it's basically
just a bio_alloc(gfp, 0); with setting the bio fields and then
submitting. Otherwise you'd have to either pass a ton of parameters, the
caller will want to set end_io, bdev, etc anyway. And after that it's
just submit_bio().

> I agree with you that the we should pass barriers down in
> __generic_make_request, but the optimization above for fsync()-originated
> blkdev_issue_flush()'s seems valid to me.

Of course, we need to do that. Anything else would be broken. The
blkdev_issue_flush() should be changed to return 0, with the -EOPNOTSUPP
being flag cached.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: [PATCH 1/7] block: Add block_flush_device()
  2009-03-31  2:47                                                                     ` Linus Torvalds
  2009-03-31  6:04                                                                       ` Jens Axboe
@ 2009-03-31 11:15                                                                       ` Ric Wheeler
  2009-03-31 14:55                                                                         ` Linus Torvalds
  1 sibling, 1 reply; 664+ messages in thread
From: Ric Wheeler @ 2009-03-31 11:15 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jens Axboe, Fernando Luis Vázquez Cao, Jeff Garzik,
	Christoph Hellwig, Theodore Tso, Ingo Molnar, Alan Cox,
	Arjan van de Ven, Andrew Morton, Peter Zijlstra, Nick Piggin,
	David Rees, Jesper Krogh, Linux Kernel Mailing List, chris.mason,
	david, tj

Linus Torvalds wrote:
> On Mon, 30 Mar 2009, Ric Wheeler wrote:
>   
>> One thing the caller could do is to disable the write cache on the device.
>>     
>
> First off, that's not the callers job. If the sysadmin enabled it, some 
> random filesystem shouldn't disable it.
>
> Secondly, this whole insane belief that "write cache" has anything to do 
> with "unable to flush" is just bogus. 
>   

First I have heard anyone (other than you above) claim that "unable to 
flush" is tied to the write cache on disks.

What I was responding to is your objection to exposing the proper error 
codes to the file system layer instead of hiding them in the block 
layer. True, the write cache example I used is pretty contrived, but it 
would be a valid strategy if your sacred sys admin had mounted with the 
"I do care about my data" mount option and left it up to the file system 
to make it happen.
>   
>> A second would be to stop using the transactions - skip the journal, 
>> just go back to ext2 mode or BSD like soft updates.
>>     
>
> f*ck me, what's so hard with understanding that EOPNOTSUPP doesn't mean 
> "no ordering". It means what it says - the op isn't supported. For all you 
> know, ALL WRITES MAY BE TOTALLY ORDERED, but perhaps there is no way to 
> make a _single_ write totally atomic (ie the "set barrier on a command 
> that actually does IO").
>   

Now you are just being silly. The drive and the write cache - without 
barriers or similar tagged operations - will almost certainly reorder 
all of the IO's internally.

No one designs code based on the "it might be ordered" basis.

The way the barriers work does absolutely give you full ordering.  All 
previous IO's are sent to the drive and flushed (barrier flush 1), the 
commit record is sent down followed by a second barrier flush. There is 
no way that the commit block will pass its dependent IO's.
> Besides, why the hell do you think the filesystem (again) should do 
> something that the admin didn't ask it to do.
>
> If the admin wants the thing to fall back to ext2, then he can ask to 
> disable the journal.
>
>   
>> Basically, it lets the file system know that its data integrity building
>> blocks are not really there and allows it (if it cares) to try and minimize
>> the chance of data loss.
>>     
>
> Your whole idiotic "as a filesystem designer I know better than everybody 
> else" model where the filesystem is in total control is total crap.
>
> The fact is, it's not the filesystems job to make that decision. If the 
> admin wants to have write caching enabled, the filesystem should get the 
> hell out of the way.
>   

This is not me being snotty - this is really very basic to how 
transactions work. You need ordering and file systems (or data bases) 
that use transactions must have these building blocks to do the job right.

Your argument seems to be, "Well, it will mostly be ordered anyway, as 
long as you don't lose power" which I simply don't agree is a good 
assumption.

The logic conclusion of that argument is that we really should not use 
transactions at all - basically remove the journal from ext3/4, xfs, 
btrfs, etc.  That is a point of view - drives are crap, journalling does 
not help anyway, why bother. 

> What about laptop mode? Do you expect your filesystem to always decide 
> that "ok, the user wanted to spin down disks, but I know better"?
>   

Laptop mode is pretty much a red herring here. Mount it without barriers 
enabled - your drive will still spin up occasionally, but as you argued 
above, that existing options allows you the user/admin to make that 
trade off.

> What about people who have UPS's and don't worry about that part? They 
> want write caching on the disk, and simply don't want to sync? They still 
> worry about OS crashing, since they run random -git development kernels?
>   
If you run with a UPS or have a battery backed write cache, you should 
run without barriers since both of those mechanisms give you the 
required promise of ordering even in face of power outage.  Again, mount 
with barriers disabled (or rely on the storage target to ignore your 
cache flush commands, which higher end gear will do on a cache flush 
command).

Not hard to do, no additional code needed. We can even automate it as it 
is done in some of the linux based home storage boxes.

> In short, stop this IDIOTIC notion that you know better. YOU DO NOT KNOW 
> BETTER. The filesystem DOES NOT KNOW BETTER. It should damn well not do 
> those kinds of decisions that are simply not filesystem decisions to make!
>
> 			Linus
>   

Not surprisingly, I still disagree with you. Based, strangely enough, on 
looking at real data over many years, not just my personal experience 
with a small handful of drives.

If you don't want to run with the data integrity that we have painfully 
baked into the file & storage stack over many years, you can simply 
mount without barriers.

Why tear down & attack the infrastructure for those users who do care?

ric




^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: [PATCH 2/7] ext3: call blkdev_issue_flush() on fsync()
  2009-03-31  1:26                                                           ` Tejun Heo
  2009-03-31  1:58                                                             ` Theodore Tso
@ 2009-03-31 11:18                                                             ` Jens Axboe
  2009-03-31 21:29                                                               ` Jeff Garzik
  1 sibling, 1 reply; 664+ messages in thread
From: Jens Axboe @ 2009-03-31 11:18 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Theodore Tso, Chris Mason, Fernando Luis Vázquez Cao,
	Jeff Garzik, Christoph Hellwig, Linus Torvalds, Ingo Molnar,
	Alan Cox, Arjan van de Ven, Andrew Morton, Peter Zijlstra,
	Nick Piggin, David Rees, Jesper Krogh, Linux Kernel Mailing List,
	david

On Tue, Mar 31 2009, Tejun Heo wrote:
> Hello,
> 
> Theodore Tso wrote:
> > On Mon, Mar 30, 2009 at 10:15:51AM -0400, Chris Mason wrote:
> >> I'm not sure we want to stick Fernando with changing how barriers are
> >> done in individual filesystems, his patch is just changing the existing
> >> call points.
> > 
> > Well, his patch actually added some calls to block_issue_flush().  But
> > yes, it's probably better if he just changes the existing call points,
> > and we can have the relevant filesystem maintainers double check to
> > make sure that there aren't any new call points which are needed.
> 
> How about having something like blk_ensure_cache_flushed() which
> issues flush iff there hasn't been any write since the last flush?
> It'll be easy to implement and will filter out duplicate flushes in
> most cases.

My original ide implementation of flushes actually did this. My memory
is a little hazy on why it was dropped, I'm guessing because it
basically never triggered anyway.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: ext3 IO latency measurements (was: Linux 2.6.29)
  2009-03-26 11:37                                 ` Theodore Tso
                                                     ` (2 preceding siblings ...)
  2009-03-26 14:03                                   ` Ingo Molnar
@ 2009-03-31 11:51                                   ` Neil Brown
  3 siblings, 0 replies; 664+ messages in thread
From: Neil Brown @ 2009-03-31 11:51 UTC (permalink / raw)
  To: Theodore Tso
  Cc: Ingo Molnar, Jan Kara, Linus Torvalds, Andrew Morton, Alan Cox,
	Arjan van de Ven, Peter Zijlstra, Nick Piggin, Jens Axboe,
	David Rees, Jesper Krogh, Linux Kernel Mailing List,
	Oleg Nesterov, Roland McGrath

On Thursday March 26, tytso@mit.edu wrote:
> Ingo,
.....
> 
> > Oh, and while at it - also a job control complaint. I tried to 
> > Ctrl-C the above script:
> > 
> > I had to hit Ctrl-C numerous times before Bash would honor it. This 
> > to is a very common thing on large SMP systems.
> 
> Well, the script you sent runs the compile in the background.  It did:
> 
> >   while :; do
> >     date
> >     make mrproper      2>/dev/null >/dev/null
> >     make defconfig     2>/dev/null >/dev/null
> >     make -j32 bzImage  2>/dev/null >/dev/null
> >   done &
>          ^^
> 
> So there would have been nothing to ^C; I assume you were running this
> with a variant that didn't have the ampersand, which would have run
> the whole shell pipeline in a detached background process?
> 
> In any case, the workaround for this is to ^Z the script, and then
> "kill %" it.
> 
> I'm pretty sure this is actually a bash problem.  When you send a
> Ctrl-C, it sends a SIGINT to all of the members of the tty's
> foreground process group.  Under some circumstances, bash sets the
> signal handler for SIGINT to be SIGIGN.  I haven't looked at this
> super closely (it would require diving into the bash sources), but you
> can see it if you attach an strace to the bash shell driving a script
> such as
> 
> #!/bin/bash
> 
> while /bin/true; do
>       date
>       sleep 60
> done &
> 
> If you do a "ps axo pid,ppid,pgrp,args", you'll see that the bash and
> the sleep 60 have the same process group.  If you emulate hitting ^C
> by sending a SIGINT to pid of the shell, you'll see that it ignores
> it.  Sleep also seems to be ignoring the SIGINT when run in the
> background; but it does honor SIGINT in the foreground --- I didn't
> have time to dig into that.
> 
> In any case, bash appears to SIGIGN the INT signal if there is a child
> process running, and only takes the ^C if bash itself is actually
> "running" the shell script.  For example, if you run the command
> "date;sleep 10;date;sleep 10;date", the ^C only interrupts the sleep
> command.  It doesn't stop the series of commands which bash is
> running.

This is something that is really hard to get right.

If the shell is running a program when SIGINT arrives, it needs to
wait until the program exits, and then try to decide if the program
died because of the signal, or actually caught the signal (from the
user's perspective), did something useful, and then chose to exit.

If the program's exit status shows that it died due to SIGINT, it is
easy to know what to do.  But lots of non-trivial programs, probably
including 'make' catch SIGINT, do some quick cleanup and then exit.
In that case the shell has a hard time deciding what to do.

I wrote a job-controlling shell many years ago and I think the
heuristic I came up with was that if the process exited with the
SIGINT status, or with a non-zero error status in less that 3 seconds
after the signal actually arrived, then react to the signal and abort
any script.  However it the process takes longer to exit or returns a
zero exit status, assume that it was interactive and handled the
interrupt to the user's satisfaction, and continue with any script.


I don't know what bash does, and it is possible that it could do a
better job.  But it is a problem for which there is no straight
forward solution (a bit like filesystem data safety it would seem :-)

NeilBrown

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: [PATCH 5/7] vfs: Add  wbcflush sysfs knob to disable storage device writeback cache flushes
  2009-03-31 10:38                                                             ` Jens Axboe
@ 2009-03-31 11:56                                                               ` Fernando Luis Vázquez Cao
  0 siblings, 0 replies; 664+ messages in thread
From: Fernando Luis Vázquez Cao @ 2009-03-31 11:56 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Jeff Garzik, Christoph Hellwig, Linus Torvalds, Theodore Tso,
	Ingo Molnar, Alan Cox, Arjan van de Ven, Andrew Morton,
	Peter Zijlstra, Nick Piggin, David Rees, Jesper Krogh,
	Linux Kernel Mailing List, chris.mason, david, tj

Jens Axboe wrote:
> On Tue, Mar 31 2009, Fernando Luis Vázquez Cao wrote:
>> Jens Axboe wrote:
>>> On Mon, Mar 30 2009, Fernando Luis Vázquez Cao wrote:
>>>> Jens Axboe wrote:
>>>>> On Mon, Mar 30 2009, Fernando Luis Vázquez Cao wrote:
>>>>>> Add a sysfs knob to disable storage device writeback cache flushes.
>>>>>>
>>>>>> Signed-off-by: Fernando Luis Vazquez Cao <fernando@oss.ntt.co.jp>
>>>>>> ---
>>>>>>
>>>>>> diff -urNp linux-2.6.29-orig/block/blk-barrier.c linux-2.6.29/block/blk-barrier.c
>>>>>> --- linux-2.6.29-orig/block/blk-barrier.c	2009-03-24 08:12:14.000000000 +0900
>>>>>> +++ linux-2.6.29/block/blk-barrier.c	2009-03-30 17:08:28.000000000 +0900
>>>>>> @@ -318,6 +318,9 @@ int blkdev_issue_flush(struct block_devi
>>>>>>  	if (!q)
>>>>>>  		return -ENXIO;
>>>>>>
>>>>>> +	if (blk_queue_nowbcflush(q))
>>>>>> +		return -EOPNOTSUPP;
>>>>>> +
>>>>>>  	bio = bio_alloc(GFP_KERNEL, 0);
>>>>>>  	if (!bio)
>>>>>>  		return -ENOMEM;
>>>>>> diff -urNp linux-2.6.29-orig/block/blk-core.c linux-2.6.29/block/blk-core.c
>>>>>> --- linux-2.6.29-orig/block/blk-core.c	2009-03-24 08:12:14.000000000 +0900
>>>>>> +++ linux-2.6.29/block/blk-core.c	2009-03-30 17:08:28.000000000 +0900
>>>>>> @@ -1452,7 +1452,8 @@ static inline void __generic_make_reques
>>>>>>  			goto end_io;
>>>>>>  		}
>>>>>>  		if (bio_barrier(bio) && bio_has_data(bio) &&
>>>>>> -		    (q->next_ordered == QUEUE_ORDERED_NONE)) {
>>>>>> +		    (blk_queue_nowbcflush(q) ||
>>>>>> +		     q->next_ordered == QUEUE_ORDERED_NONE)) {
>>>>>>  			err = -EOPNOTSUPP;
>>>>>>  			goto end_io;
>>>>>>  		}
>>>>> This (and the above hunk) should be changed. -EOPNOTSUPP means the
>>>>> target does not support barriers, that is a different thing to flushes
>>>>> not being needed. A file system issuing a barrier and getting
>>>>> -EOPNOTSUPP back will disable barriers, since it now thinks that
>>>>> ordering cannot be guaranteed.
>>>> The reason I decided to use -EOPNOTSUPP was that I wanted to keep
>>>> barriers and device flushes from entering the block layer when
>>>> they are not needed. I feared that if we pass them down the block
>>>> stack (knowing in advance they will not be actually submitted to
>>>> disk) we may end up slowing things down unnecessarily.
>>> But that's just wrong, you need to make sure that the block layer / io
>>> scheduler doesn't reorder as well. It's a lot more complex than just the
>>> device end. So just returning -EOPNOTSUPP and pretending that you need
>>> not use barriers at the fs end is just wrong.
>> I should have mentioned that in this patch set I was trying to tackle the
>> blkdev_issue_flush() case only. As you pointed out, with the code above
>> requests may get silently reordered across barriers inside the block layer.
>>
>> The follow-up patch I am working on implements blkdev_issue_empty_barrier(),
>> which should be used by filesystems that want to emit an empty barrier (as
>> opposed to just triggering a device flush). Doing this we can optimize
>> fsync() flushes (block_flush_device()) and filesystem-originated barriers
>> (blkdev_issue_empty_barrier()) independently in the block layer.
> 
> Not sure it makes sense to abstract that out into an api, it's basically
> just a bio_alloc(gfp, 0); with setting the bio fields and then
> submitting. Otherwise you'd have to either pass a ton of parameters, the
> caller will want to set end_io, bdev, etc anyway. And after that it's
> just submit_bio().

I will give it a try and see how it looks.

>> I agree with you that the we should pass barriers down in
>> __generic_make_request, but the optimization above for fsync()-originated
>> blkdev_issue_flush()'s seems valid to me.
> 
> Of course, we need to do that. Anything else would be broken. The
> blkdev_issue_flush() should be changed to return 0, with the -EOPNOTSUPP
> being flag cached.

I am currently cooking a new iteration of these patches that do just
that. I will be reposting in a new thread and keep you all CCed.

- Fernando

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: [PATCH 1/7] block: Add block_flush_device()
  2009-03-30 20:52                                                                 ` Mark Lord
  2009-03-30 20:57                                                                   ` Jeff Garzik
@ 2009-03-31 13:16                                                                   ` Chris Mason
  2009-03-31 13:23                                                                     ` Mark Lord
  2009-03-31 15:49                                                                   ` Eric Sandeen
  2 siblings, 1 reply; 664+ messages in thread
From: Chris Mason @ 2009-03-31 13:16 UTC (permalink / raw)
  To: Mark Lord
  Cc: Jens Axboe, Linus Torvalds, Fernando Luis Vázquez Cao,
	Jeff Garzik, Christoph Hellwig, Theodore Tso, Ingo Molnar,
	Alan Cox, Arjan van de Ven, Andrew Morton, Peter Zijlstra,
	Nick Piggin, David Rees, Jesper Krogh, Linux Kernel Mailing List,
	david, tj

On Mon, 2009-03-30 at 16:52 -0400, Mark Lord wrote:
> Jens Axboe wrote:
> > On Mon, Mar 30 2009, Linus Torvalds wrote:
> >>
> >> On Mon, 30 Mar 2009, Jens Axboe wrote:
> >>> Sorry, I just don't see much point to doing it this way instead. So now
> >>> the fs will have to check a queue bit after it has issued the flush, how
> >>> is that any better than having the 'error' returned directly?
> >> No.
> >>
> >> Now the fs SHOULD NEVER CHECK AT ALL.
> >>
> >> Either it did the ordering, or the FS cannot do anything about it. 
> >>
> >> That's the point. EOPNOTSUPP is n ot a useful error message. You can't 
> >> _do_ anything about it.
> > 
> > My point is that some file systems may or may not have different paths
> > or optimizations depending on whether barriers are enabled and working
> > or not. Apparently that's just reiserfs and Chris says we can remove it,
> > so it is probably a moot point.
> ..
> 
> XFS appears to have something along those lines.
> I believe it tries to disable the drive write caches
> if it discovers that it cannot do cache flushes.
> 

If we get EOPNOTSUPP back from a submit_bh/submit_bio, the IO didn't
happen.  So, all the filesystems have code to try again without the
barrier flag, and then stop doing barriers from then on.

I'm not saying this is a good or bad API, just explaining for this one
example how it is being used today ;)

> I'll check next time my MythTV box boots up.
> It has a RAID0 under XFS, and the md raid0 code doesn't
> appear to pass the cache flushes to libata for raid0,
> so XFS complains and tries to turn off the write caches.
> 
>
> And I have a script to damn well turn them back ON again
> after it does so.  Stupid thing tries to override user policy again.
> 

XFS does print a warning about not doing barriers any more, but the
write cache should still be on.  Especially with MD in front of it, the
storage stack is pretty complex, a mounted filesystem would have a hard
time knowing where to start to turn off write caches on each drive in
the stack.

You can test this pretty easily:

dd if=/dev/zero of=foo bs=4k count=10000 oflag=direct

If that runs faster than 1MB/s the write cache is still on.

-chris



^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: [PATCH 1/7] block: Add block_flush_device()
  2009-03-31 13:16                                                                   ` Chris Mason
@ 2009-03-31 13:23                                                                     ` Mark Lord
  2009-03-31 13:28                                                                       ` Chris Mason
  0 siblings, 1 reply; 664+ messages in thread
From: Mark Lord @ 2009-03-31 13:23 UTC (permalink / raw)
  To: Chris Mason
  Cc: Jens Axboe, Linus Torvalds, Fernando Luis Vázquez Cao,
	Jeff Garzik, Christoph Hellwig, Theodore Tso, Ingo Molnar,
	Alan Cox, Arjan van de Ven, Andrew Morton, Peter Zijlstra,
	Nick Piggin, David Rees, Jesper Krogh, Linux Kernel Mailing List,
	david, tj

Chris Mason wrote:
>
> You can test this pretty easily:
> 
> dd if=/dev/zero of=foo bs=4k count=10000 oflag=direct
> 
> If that runs faster than 1MB/s the write cache is still on.
..

Or simply:   hdparm -W /dev/sd?   ## (for SATA/PATA drives)

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: [PATCH 1/7] block: Add block_flush_device()
  2009-03-31 13:23                                                                     ` Mark Lord
@ 2009-03-31 13:28                                                                       ` Chris Mason
  0 siblings, 0 replies; 664+ messages in thread
From: Chris Mason @ 2009-03-31 13:28 UTC (permalink / raw)
  To: Mark Lord
  Cc: Jens Axboe, Linus Torvalds, Fernando Luis Vázquez Cao,
	Jeff Garzik, Christoph Hellwig, Theodore Tso, Ingo Molnar,
	Alan Cox, Arjan van de Ven, Andrew Morton, Peter Zijlstra,
	Nick Piggin, David Rees, Jesper Krogh, Linux Kernel Mailing List,
	david, tj

On Tue, 2009-03-31 at 09:23 -0400, Mark Lord wrote:
> Chris Mason wrote:
> >
> > You can test this pretty easily:
> > 
> > dd if=/dev/zero of=foo bs=4k count=10000 oflag=direct
> > 
> > If that runs faster than 1MB/s the write cache is still on.
> ..
> 
> Or simply:   hdparm -W /dev/sd?   ## (for SATA/PATA drives)

I'm afraid I tend to hammer on the drive instead of asking it politely,
but I guess hdparm is trust worthy these days ;)

-chris



^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-30 22:07           ` Arjan van de Ven
  2009-03-30 10:18             ` Pavel Machek
@ 2009-03-31 13:33             ` Rafael J. Wysocki
  2009-03-31 15:30             ` Hans-Peter Jansen
  2009-03-31 19:37             ` Jeff Garzik
  3 siblings, 0 replies; 664+ messages in thread
From: Rafael J. Wysocki @ 2009-03-31 13:33 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Hans-Peter Jansen, Linus Torvalds, Mike Galbraith,
	Geert Uytterhoeven, linux-kernel

On Tuesday 31 March 2009, Arjan van de Ven wrote:
> Hans-Peter Jansen wrote:
> > Am Freitag, 27. März 2009 schrieb Linus Torvalds:
> > 
> > I always wonder, why Arjan does not intervene for his kerneloops.org 
> > project, since your approach opens a window of uncertainty during the merge 
> > window when simply using git as an efficient fetch tool.
> 
> I would *love* it if Linus would, as first commit mark his tree as "-git0"
> (as per snapshots) or "-rc0". So that I can split the "final" versus
> "merge window" oopses.

FWIW, that would also be useful for tracking regressions.

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: [PATCH 1/7] block: Add block_flush_device()
  2009-03-31 11:15                                                                       ` Ric Wheeler
@ 2009-03-31 14:55                                                                         ` Linus Torvalds
  2009-03-31 15:22                                                                           ` Chris Mason
                                                                                             ` (2 more replies)
  0 siblings, 3 replies; 664+ messages in thread
From: Linus Torvalds @ 2009-03-31 14:55 UTC (permalink / raw)
  To: Ric Wheeler
  Cc: Jens Axboe, Fernando Luis Vázquez Cao, Jeff Garzik,
	Christoph Hellwig, Theodore Tso, Ingo Molnar, Alan Cox,
	Arjan van de Ven, Andrew Morton, Peter Zijlstra, Nick Piggin,
	David Rees, Jesper Krogh, Linux Kernel Mailing List, chris.mason,
	david, tj



On Tue, 31 Mar 2009, Ric Wheeler wrote:
> 
> Now you are just being silly. The drive and the write cache - without barriers
> or similar tagged operations - will almost certainly reorder all of the IO's
> internally.

You do realize that the "drive" may not be a drive at all?

But apparently you don't. You really seem to see just your own case, and 
have blinders on for everything else.

That "drive" may be some virtualized device. It may be some super-fancy 
memory mapped and largely undocumented random flash thing. It might be a 
network block device, it may be somebody's IO trace dummy layer, it may be 
anything at all.

Your filesystem doesn't know. It damn well not even _try_ to know, because 
it isn't the low-level driver.

The low-level driver - which you don't have a friggin clue about - may say 
that it doesn't support barrier IO for any random reason that has 
absolutely _nothing_ to do with any write caches or anything else. Maybe 
the device has the same ordering semantics as an Intel CPU has: writes are 
always seen in order on the disk, and reads are always speculated but will 
snoop in write buffers, and ther is no way to not do that.

See? EOPNOTSUPP means just that - it means that the driver doesn't support 
the notion of ordered IO. But that does not necessarily mean that the 
writes aren't always in order. It may well just mean that the drive is a 
thin shimmy layer over something else (for example, just a user level 
pipe), and the driver has NO IDEA what the end result is, and the protocol 
is simplistic and is just 'read' and 'write' and absolutely nothing else.

But you seem to NOT UNDERSTAND THIS.

I'm not interested in your inane drivel. Let's just say that your lack of 
understanding just means that your input is irrelevant, and leave it at 
that. Ok? Until you can see the bigger picture, just don't bother.

			Linus

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-28  1:19                                                               ` Jeff Garzik
                                                                                   ` (2 preceding siblings ...)
  2009-03-29  0:33                                                                 ` david
@ 2009-03-31 15:01                                                                 ` Thierry Vignaud
  3 siblings, 0 replies; 664+ messages in thread
From: Thierry Vignaud @ 2009-03-31 15:01 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Linus Torvalds, Matthew Garrett, Alan Cox, Theodore Tso,
	Andrew Morton, David Rees, Jesper Krogh,
	Linux Kernel Mailing List

Jeff Garzik <jeff@garzik.org> writes:

> > Of course, your browsing history database is an excellent example of
> > something you should _not_ care about that much, and where
> > performance is a lot more important than "ooh, if the machine goes
> > down suddenly, I need to be 100% up-to-date". Using fsync on that
> > thing was just stupid, even 
> 
> If you are doing a ton of web-based work with a bunch of tabs or
> windows open, you really like the post-crash restoration methods that
> Firefox now employs.  Some users actually do want to
> checkpoint/restore their web work, regardless of whether it was the
> browser, the window system or the OS that crashed.

This is all about tradeoff.
I guess everybody can afford loosing the last 30 seconds of history (or
5mn ...).
That's not that much of lost work...

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: [PATCH 1/7] block: Add block_flush_device()
  2009-03-31 14:55                                                                         ` Linus Torvalds
@ 2009-03-31 15:22                                                                           ` Chris Mason
  2009-03-31 15:41                                                                           ` Ric Wheeler
  2009-03-31 15:54                                                                           ` Linus Torvalds
  2 siblings, 0 replies; 664+ messages in thread
From: Chris Mason @ 2009-03-31 15:22 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ric Wheeler, Jens Axboe, Fernando Luis Vázquez Cao,
	Jeff Garzik, Christoph Hellwig, Theodore Tso, Ingo Molnar,
	Alan Cox, Arjan van de Ven, Andrew Morton, Peter Zijlstra,
	Nick Piggin, David Rees, Jesper Krogh, Linux Kernel Mailing List,
	david, tj

On Tue, 2009-03-31 at 07:55 -0700, Linus Torvalds wrote:
> 
> On Tue, 31 Mar 2009, Ric Wheeler wrote:
> > 
> > Now you are just being silly. The drive and the write cache - without barriers
> > or similar tagged operations - will almost certainly reorder all of the IO's
> > internally.
> 
> You do realize that the "drive" may not be a drive at all?
> 
> But apparently you don't. You really seem to see just your own case, and 
> have blinders on for everything else.
> 
> That "drive" may be some virtualized device. It may be some super-fancy 
> memory mapped and largely undocumented random flash thing. It might be a 
> network block device, it may be somebody's IO trace dummy layer, it may be 
> anything at all.
> 

The part that we seem to be skipping over in talking about EOPNOTSUPP is
not what do we do when a barrier isn't supported (print a warning and
move on), it's what do we do when a barrier works.  I very much agree
that EOPNOTSUPP tells us almost nothing.

The idea behind the original implementation was that when barriers did
work, we could make some assumptions about how IO would be ordered
around the barrier, and those assumptions would let us optimize things
for the lying cheating cache enabled storage that we all know and love.

It turns out 6 years later that very few people are interested in those
optimizations, and we're probably better off skipping them in favor of
reducing the complexity of the code involved.

Jens has a little burial site all prepped for pdflush in his yard,
dumping EOPNOTSUPP in there too wouldn't be a bad thing.

-chris



^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-30 22:07           ` Arjan van de Ven
  2009-03-30 10:18             ` Pavel Machek
  2009-03-31 13:33             ` Rafael J. Wysocki
@ 2009-03-31 15:30             ` Hans-Peter Jansen
  2009-03-31 19:37             ` Jeff Garzik
  3 siblings, 0 replies; 664+ messages in thread
From: Hans-Peter Jansen @ 2009-03-31 15:30 UTC (permalink / raw)
  To: Arjan van de Ven, gitster
  Cc: Linus Torvalds, Mike Galbraith, Geert Uytterhoeven, linux-kernel

Am Dienstag, 31. März 2009 schrieb Arjan van de Ven:
> Hans-Peter Jansen wrote:
> >
> > I always wonder, why Arjan does not intervene for his kerneloops.org
> > project, since your approach opens a window of uncertainty during the
> > merge window when simply using git as an efficient fetch tool.
>
> I would *love* it if Linus would, as first commit mark his tree as
> "-git0" (as per snapshots) or "-rc0". So that I can split the "final"
> versus "merge window" oopses.

..which is an important difference. I still vote for -pre for "preparation 
state" as -git0 does imply some sort of versioning, which *is* meaningless 
in this state.

Linus, this would be a small step for you, but makes a big difference for 
those of us, that miss it sorely. 

Junio: is it possible to automate this in git somehow: make sure, that the 
first commit after a release really happens for a "new" version (e.g. a 
version patch to Makefile)?

Pete

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: [PATCH 1/7] block: Add block_flush_device()
  2009-03-31 14:55                                                                         ` Linus Torvalds
  2009-03-31 15:22                                                                           ` Chris Mason
@ 2009-03-31 15:41                                                                           ` Ric Wheeler
  2009-03-31 16:15                                                                             ` Linus Torvalds
  2009-03-31 19:25                                                                             ` Mark Lord
  2009-03-31 15:54                                                                           ` Linus Torvalds
  2 siblings, 2 replies; 664+ messages in thread
From: Ric Wheeler @ 2009-03-31 15:41 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jens Axboe, Fernando Luis Vázquez Cao, Jeff Garzik,
	Christoph Hellwig, Theodore Tso, Ingo Molnar, Alan Cox,
	Arjan van de Ven, Andrew Morton, Peter Zijlstra, Nick Piggin,
	David Rees, Jesper Krogh, Linux Kernel Mailing List, chris.mason,
	david, tj

Linus Torvalds wrote:
> 
> On Tue, 31 Mar 2009, Ric Wheeler wrote:
>> Now you are just being silly. The drive and the write cache - without barriers
>> or similar tagged operations - will almost certainly reorder all of the IO's
>> internally.
> 
> You do realize that the "drive" may not be a drive at all?
> 
> But apparently you don't. You really seem to see just your own case, and 
> have blinders on for everything else.
> 
> That "drive" may be some virtualized device. It may be some super-fancy 
> memory mapped and largely undocumented random flash thing. It might be a 
> network block device, it may be somebody's IO trace dummy layer, it may be 
> anything at all.

Of course I realize that.

Most of the SSD devices, including ones that don't speak normal S-ATA/SCSI/etc, 
they have a write cache and will combine and re-order IO's.

Some of them have non-volatile write caches and those don't need barriers 
(flush, fua, what ever) because of batteries, capacitors or other magic hardware 
people came up with.

For the ones that do have a volatile write cache and can reorder IO's, 
transactions will still need the ordering primitives to survive a power failure 
reliably.

If you don't need or want to pay the price of ordering, you can today easily 
disable this by mounting without barriers.

As Mark pointed out, most S-ATA/SAS drives will flush the write cache when they 
see a bus reset so even without barriers, the cache will be preserved (or 
flushed) after a reboot or panic.  Power outages are the problem 
barriers/flushes are meant to help with.

> 
> Your filesystem doesn't know. It damn well not even _try_ to know, because 
> it isn't the low-level driver.
> 
> The low-level driver - which you don't have a friggin clue about - may say 
> that it doesn't support barrier IO for any random reason that has 
> absolutely _nothing_ to do with any write caches or anything else. Maybe 
> the device has the same ordering semantics as an Intel CPU has: writes are 
> always seen in order on the disk, and reads are always speculated but will 
> snoop in write buffers, and ther is no way to not do that.
> 
> See? EOPNOTSUPP means just that - it means that the driver doesn't support 
> the notion of ordered IO. But that does not necessarily mean that the 
> writes aren't always in order. It may well just mean that the drive is a 
> thin shimmy layer over something else (for example, just a user level 
> pipe), and the driver has NO IDEA what the end result is, and the protocol 
> is simplistic and is just 'read' and 'write' and absolutely nothing else.
> 
> But you seem to NOT UNDERSTAND THIS.
> 
> I'm not interested in your inane drivel. Let's just say that your lack of 
> understanding just means that your input is irrelevant, and leave it at 
> that. Ok? Until you can see the bigger picture, just don't bother.
> 
> 			Linus


If the low level device returns EOPNOTSUPP on a barrier op, that is fine. 
Running a transactional file system on that storage might or might not be a good 
idea, but at least we can log that and move on.

I agree with Chris that what happens when the device does not support the 
primitives is not the core issue.

The question is really what we do when you have a storage device in your box 
with a volatile write cache that does support flush or fua or similar. Using 
barriers & ordered transactions for these types of devices will give you a more 
reliable file system - less fsck time needed and better data integrity support 
for the (few?) applications that use fsync properly.


Ric


^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: [PATCH 1/7] block: Add block_flush_device()
  2009-03-30 20:52                                                                 ` Mark Lord
  2009-03-30 20:57                                                                   ` Jeff Garzik
  2009-03-31 13:16                                                                   ` Chris Mason
@ 2009-03-31 15:49                                                                   ` Eric Sandeen
  2009-03-31 16:37                                                                     ` Mark Lord
  2 siblings, 1 reply; 664+ messages in thread
From: Eric Sandeen @ 2009-03-31 15:49 UTC (permalink / raw)
  To: Mark Lord
  Cc: Jens Axboe, Linus Torvalds, Fernando Luis Vázquez Cao,
	Jeff Garzik, Christoph Hellwig, Theodore Tso, Ingo Molnar,
	Alan Cox, Arjan van de Ven, Andrew Morton, Peter Zijlstra,
	Nick Piggin, David Rees, Jesper Krogh, Linux Kernel Mailing List,
	chris.mason, david, tj

Mark Lord wrote:
> Jens Axboe wrote:
>> On Mon, Mar 30 2009, Linus Torvalds wrote:
>>> On Mon, 30 Mar 2009, Jens Axboe wrote:
>>>> Sorry, I just don't see much point to doing it this way instead. So now
>>>> the fs will have to check a queue bit after it has issued the flush, how
>>>> is that any better than having the 'error' returned directly?
>>> No.
>>>
>>> Now the fs SHOULD NEVER CHECK AT ALL.
>>>
>>> Either it did the ordering, or the FS cannot do anything about it. 
>>>
>>> That's the point. EOPNOTSUPP is n ot a useful error message. You can't 
>>> _do_ anything about it.
>> My point is that some file systems may or may not have different paths
>> or optimizations depending on whether barriers are enabled and working
>> or not. Apparently that's just reiserfs and Chris says we can remove it,
>> so it is probably a moot point.
> ..
> 
> XFS appears to have something along those lines.
> I believe it tries to disable the drive write caches
> if it discovers that it cannot do cache flushes.

No, it just stops issuing barriers if the initial mount-time test finds
that they're not supported.  ext3/4/reiserfs do similar.

> I'll check next time my MythTV box boots up.
> It has a RAID0 under XFS, and the md raid0 code doesn't
> appear to pass the cache flushes to libata for raid0,
> so XFS complains and tries to turn off the write caches.

doesn't touch write caches; just complains and stops issuing barriers.

> And I have a script to damn well turn them back ON again
> after it does so.  Stupid thing tries to override user policy again.

It does not do this.

-Eric

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: [PATCH 1/7] block: Add block_flush_device()
  2009-03-31 14:55                                                                         ` Linus Torvalds
  2009-03-31 15:22                                                                           ` Chris Mason
  2009-03-31 15:41                                                                           ` Ric Wheeler
@ 2009-03-31 15:54                                                                           ` Linus Torvalds
  2009-03-31 16:29                                                                             ` Alan Cox
  2 siblings, 1 reply; 664+ messages in thread
From: Linus Torvalds @ 2009-03-31 15:54 UTC (permalink / raw)
  To: Ric Wheeler
  Cc: Jens Axboe, Fernando Luis Vázquez Cao, Jeff Garzik,
	Christoph Hellwig, Theodore Tso, Ingo Molnar, Alan Cox,
	Arjan van de Ven, Andrew Morton, Peter Zijlstra, Nick Piggin,
	David Rees, Jesper Krogh, Linux Kernel Mailing List, chris.mason,
	david, tj



On Tue, 31 Mar 2009, Linus Torvalds wrote:
> 
> Ok? Until you can see the bigger picture, just don't bother.

And this is really what it boils down to. Abstraction. Bigger picture. And 
the fact that the filesystem should DAMN WELL NOT THINK IT KNOWS WHAT IS 
GOING ON!

This is also fundamentally why returning that particular error is 
pointless. If the driver returns EOPNOTSUPP, there is simply never _any_ 
possible reason for upper layers to ever be informed about it - because 
there is not _any_ possible situation where they can do anything about it. 

Even _thinking_ that they can do something about it is fundamentally 
flawed. It misses the entire point of having layering and abstraction and 
having a "block layer" there to do these kinds of things.

If you want to write your filesystem so that it interacts with the 
low-level device directly, go and write an MTD filesystem instead. Don't 
even _claim_ to care about generic filesystems like 'ext3' or something 
like that. 

But if you try to be a "real" filesystem (ie general-purpose, meant to 
work on any random block device), don't come and whine about it when the 
block device then doesn't really do anything but read or write, or when 
the driver literally doesn't even _know_ how to serialize something 
because it doesn't even make sense in its world-view.

Don't mix up block layer and low-level driver issues with filesystem 
issues. The filesystem should say "block layer: flush the pending writes". 
And the block layer should try its best, but if the low-level driver says 
"that operation doesn't make sense for me", the block layer should just 
say "ok, whatever".

And the filesystem shouldn't know, and it most definitely mustr not act 
any differently. Because that's behind the abstraction, and there's no 
sane way to bring it _out_ of the abstraction that isn't fundamentally 
flawed (like thinking that it's always a SATA-II drive).

		Linus

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: [PATCH 1/7] block: Add block_flush_device()
  2009-03-31 15:41                                                                           ` Ric Wheeler
@ 2009-03-31 16:15                                                                             ` Linus Torvalds
  2009-03-31 16:43                                                                               ` Jens Axboe
  2009-03-31 17:14                                                                               ` Ric Wheeler
  2009-03-31 19:25                                                                             ` Mark Lord
  1 sibling, 2 replies; 664+ messages in thread
From: Linus Torvalds @ 2009-03-31 16:15 UTC (permalink / raw)
  To: Ric Wheeler
  Cc: Jens Axboe, Fernando Luis Vázquez Cao, Jeff Garzik,
	Christoph Hellwig, Theodore Tso, Ingo Molnar, Alan Cox,
	Arjan van de Ven, Andrew Morton, Peter Zijlstra, Nick Piggin,
	David Rees, Jesper Krogh, Linux Kernel Mailing List, chris.mason,
	david, tj



On Tue, 31 Mar 2009, Ric Wheeler wrote:
> 
> The question is really what we do when you have a storage device in your box
> with a volatile write cache that does support flush or fua or similar.

Ok. Then you are talking about a different case - not EOPNOTSUPP.

[ Although it may be related in that maybe the admin can _force_ a 
  EOPNOTSUPP thing for when he wants to disable any "write barrier implies 
  flush" thing.

  IOW, we may end up with an _implementation_ detail where we overload a 
  potential QUEUE_FLUSH_EOPNOTSUPP flag with two meanings - either "the 
  driver told me a barrier isn't supported" or "the admin set that same 
  flag by hand to disable barrier-related flush commands".

  But that's just an implementation detail, of course. We could use two 
  different flags, we could do the flags at different levels, whatever. ]

> Using barriers & ordered transactions for these types of devices will 
> give you a more reliable file system - less fsck time needed and better 
> data integrity support for the (few?) applications that use fsync 
> properly.

Sure. And it still shouldn't be the filesystem that _requires_ use of it.

The user (or low-level driver) may simply know better. The user may 
know that he trusts the disk more than anything else, and prefers to 
not actually emit the "FLUSH" command. Again, that's not something that 
the filesystem should know about, or care about. If the user trusts the 
disk subsystem and wants the performance, it's the users choice.

Even the _driver_ may know better.

Knowing the kinds of firmware bugs those drives have, it could even be a 
driver that simply black-lists certain disks as having known-broken FLUSH 
commands. We have _CPU's_ that corrupt memory on cache writeback 
("wbinvl"), and those things are a lot more tested than most driver 
firmware is.

Do you realize just how buggy some of those flash drives are? Some of them 
will literally (a) report the wrong size and (b) lock up if you try to 
read from the last sector. Oops. Do you really expect such crap to 
even bother to honor some flush command? Good luck with that. They're 
designed as a floppy replacement.

Now, you can tell me that I shouldn't put a reliable filesystem on an 
el-cheapo flash drive and expect it to work, but I'm sorry, you're wrong. 
People _are_ supposed to be able to move their data around, and the 
filesystem shouldn't make judgement calls. If you want judgement calls, 
call your mom. Not your filesystem.

For another example, the driver might be a driver for a high-end 
battery-backup SCSI RAID controller. It knows that the controller _will_ 
write things out in the right order even in the case of a crash, but it 
may also know that the controller _also_ has a way to force a flush to 
actual hardware.

When do you want to force a flush? For hotplug events, for example. Maybe 
the disks won't be _connected_ any more afterwards - then the battery 
backup on the controller won't be helping, will it? So there may well be a 
flush event thing, but it's really up to the admin to decide whether it 
should be connected to a write barrier thing, or be a separate admin 
activity.

Maybe the admin is extra careful and anal, and decides that he wants to 
flush to disk platters _despite_ the battery backup. Maybe he doesn't 
trust the card. Maybe he does.  Whatever. The point is that the admin 
might want to set a driver flag that does the flush or not, adn it's 
totally not a filesystem issue.

See? The filesystem has absolutely _no_place_ deciding these kinds of 
things. The only thing it can ask for is "please serialize", but what 
_level_ of serialization is simply not a filesystem decision to make.

And that very much includes the level of serialization that says "no 
serialization what-so-ever, and please go absolutely crazy with your 
cache". Not your choice.

So no, you can't have a pony.

			Linus

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: [PATCH 1/7] block: Add block_flush_device()
  2009-03-31 15:54                                                                           ` Linus Torvalds
@ 2009-03-31 16:29                                                                             ` Alan Cox
  0 siblings, 0 replies; 664+ messages in thread
From: Alan Cox @ 2009-03-31 16:29 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ric Wheeler, Jens Axboe, Fernando Luis Vázquez Cao,
	Jeff Garzik, Christoph Hellwig, Theodore Tso, Ingo Molnar,
	Arjan van de Ven, Andrew Morton, Peter Zijlstra, Nick Piggin,
	David Rees, Jesper Krogh, Linux Kernel Mailing List, chris.mason,
	david, tj

> And the filesystem shouldn't know, and it most definitely mustr not act 
> any differently. Because that's behind the abstraction, and there's no 
> sane way to bring it _out_ of the abstraction that isn't fundamentally 
> flawed (like thinking that it's always a SATA-II drive).

How the file system responds has to depend upon what the users intents
are with regards to still having their data.

In a lot of cases "flush if you can" makes good sense. In higher
integrity cases you want a way to tell the device "flush if you can, do
whatever else is needed to fake a flush if not" and in some cases you
genuinely want to propogate errors back at mount time to say "sorry can't
do this"

Agreed entirely that this shouldn't be expressed down the stack in terms
of things like 'tags' or 'write with fua', but unless the different
versions of it can be expressed, or refused you can't build a good enough
abstraction. Throw and pray the block layer can fake it simply isn't a
valid model for serious enterprise computing, and if people understood
the worst cases, for a lot of non enterprise computing.

The second problem is who has sufficient information to efficiently
handle decisions around ordering/barriers/flushes/single outstanding
command and other strategies. I am skeptical that in the case where the
underlying block subsystem provides suboptimal ordering/barrier
facilities that it falling back to alternatives without letting the fs
also change strategies will be efficient.

Alan

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: [PATCH 1/7] block: Add block_flush_device()
  2009-03-31 15:49                                                                   ` Eric Sandeen
@ 2009-03-31 16:37                                                                     ` Mark Lord
  0 siblings, 0 replies; 664+ messages in thread
From: Mark Lord @ 2009-03-31 16:37 UTC (permalink / raw)
  To: Eric Sandeen
  Cc: Jens Axboe, Linus Torvalds, Fernando Luis Vázquez Cao,
	Jeff Garzik, Christoph Hellwig, Theodore Tso, Ingo Molnar,
	Alan Cox, Arjan van de Ven, Andrew Morton, Peter Zijlstra,
	Nick Piggin, David Rees, Jesper Krogh, Linux Kernel Mailing List,
	chris.mason, david, tj

Eric Sandeen wrote:
> Mark Lord wrote:
..
>> XFS appears to have something along those lines.
>> I believe it tries to disable the drive write caches
>> if it discovers that it cannot do cache flushes.
> 
> No, it just stops issuing barriers if the initial mount-time test finds
> that they're not supported.  ext3/4/reiserfs do similar.
..

Okay.  My apologies to the XFS folks!

I'll have to dig deeper to find out who/what is disabling
the drive write caches, then.

Thanks

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: [PATCH 1/7] block: Add block_flush_device()
  2009-03-31 16:15                                                                             ` Linus Torvalds
@ 2009-03-31 16:43                                                                               ` Jens Axboe
  2009-03-31 16:57                                                                                 ` Linus Torvalds
  2009-03-31 17:03                                                                                 ` Jens Axboe
  2009-03-31 17:14                                                                               ` Ric Wheeler
  1 sibling, 2 replies; 664+ messages in thread
From: Jens Axboe @ 2009-03-31 16:43 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ric Wheeler, Fernando Luis Vázquez Cao, Jeff Garzik,
	Christoph Hellwig, Theodore Tso, Ingo Molnar, Alan Cox,
	Arjan van de Ven, Andrew Morton, Peter Zijlstra, Nick Piggin,
	David Rees, Jesper Krogh, Linux Kernel Mailing List, chris.mason,
	david, tj

On Tue, Mar 31 2009, Linus Torvalds wrote:
> 
> 
> On Tue, 31 Mar 2009, Ric Wheeler wrote:
> > 
> > The question is really what we do when you have a storage device in your box
> > with a volatile write cache that does support flush or fua or similar.
> 
> Ok. Then you are talking about a different case - not EOPNOTSUPP.

So here's a test patch that attempts to just ignore such a failure to
flush the caches. It will still flag the bio as BIO_EOPNOTSUPP, but
that's merely maintaining the information in case the caller does want
to see if that barrier failed or not. It may not actually be useful, in
which case we can just kill that flag.

But it'll return 0 for a write, getting rid of hard retry logic in the
file systems. It'll also ensure that blkdev_issue_flush() does not see
the -EOPNOTSUPP and pass it back.

The first time we see such a failed barrier, we'll log a warning in
dmesg about the block device. Subsequent failed barriers with
-EOPNOTSUPP will bit warn.

Now, there's a follow up to this. If the device doesn't support barriers
and the block layer fails them early, we should still do the ordering
inside the block layer. Then we will at least not reorder there, even if
the device may or may not order. I'll test this patch and provide a
follow up patch that does that as well, before asking for any of this to
be included. So that's a note to not apply this patch, it hasn't been
tested!

commit 78ab31910c8c7b8853c1fd4d78c5f4ce2aebb516
Author: Jens Axboe <jens.axboe@oracle.com>
Date:   Tue Mar 31 18:42:42 2009 +0200

    barrier: Don't return -EOPNOTSUPP to the caller if the device does not support barriers
    
    The caller cannot really do much about the situation anyway. Instead log
    a warning if this is the first such failed barrier we see, so that the
    admin can look into whether this poses a data integrity problem or not.
    
    Signed-off-by: Jens Axboe <jens.axboe@oracle.com>

diff --git a/block/blk-barrier.c b/block/blk-barrier.c
index f7dae57..8660146 100644
--- a/block/blk-barrier.c
+++ b/block/blk-barrier.c
@@ -338,9 +338,7 @@ int blkdev_issue_flush(struct block_device *bdev, sector_t *error_sector)
 		*error_sector = bio->bi_sector;
 
 	ret = 0;
-	if (bio_flagged(bio, BIO_EOPNOTSUPP))
-		ret = -EOPNOTSUPP;
-	else if (!bio_flagged(bio, BIO_UPTODATE))
+	if (!bio_flagged(bio, BIO_UPTODATE))
 		ret = -EIO;
 
 	bio_put(bio);
@@ -408,9 +406,7 @@ int blkdev_issue_discard(struct block_device *bdev,
 		submit_bio(DISCARD_BARRIER, bio);
 
 		/* Check if it failed immediately */
-		if (bio_flagged(bio, BIO_EOPNOTSUPP))
-			ret = -EOPNOTSUPP;
-		else if (!bio_flagged(bio, BIO_UPTODATE))
+		if (!bio_flagged(bio, BIO_UPTODATE))
 			ret = -EIO;
 		bio_put(bio);
 	}
diff --git a/block/blk-settings.c b/block/blk-settings.c
index 59fd05d..0a81466 100644
--- a/block/blk-settings.c
+++ b/block/blk-settings.c
@@ -463,6 +463,19 @@ void blk_queue_update_dma_alignment(struct request_queue *q, int mask)
 }
 EXPORT_SYMBOL(blk_queue_update_dma_alignment);
 
+void blk_queue_set_noflush(struct block_device *bdev)
+{
+	struct request_queue *q = bdev_get_queue(bdev);
+	char b[BDEVNAME_SIZE];
+
+	if (test_and_set_bit(QUEUE_FLAG_NOFLUSH, &q->queue_flags))
+		return;
+
+	printk(KERN_ERR "Device %s does not appear to honor cache flushes. "	
+			"This may mean that file system ordering guarantees "
+			"are not met.", bdevname(bdev, b));
+}
+
 static int __init blk_settings_init(void)
 {
 	blk_max_low_pfn = max_low_pfn - 1;
diff --git a/block/ioctl.c b/block/ioctl.c
index 0f22e62..769f7be 100644
--- a/block/ioctl.c
+++ b/block/ioctl.c
@@ -166,9 +166,7 @@ static int blk_ioctl_discard(struct block_device *bdev, uint64_t start,
 
 		wait_for_completion(&wait);
 
-		if (bio_flagged(bio, BIO_EOPNOTSUPP))
-			ret = -EOPNOTSUPP;
-		else if (!bio_flagged(bio, BIO_UPTODATE))
+		if (!bio_flagged(bio, BIO_UPTODATE))
 			ret = -EIO;
 		bio_put(bio);
 	}
diff --git a/fs/bio.c b/fs/bio.c
index a040cde..79e3cec 100644
--- a/fs/bio.c
+++ b/fs/bio.c
@@ -1380,7 +1380,17 @@ void bio_check_pages_dirty(struct bio *bio)
  **/
 void bio_endio(struct bio *bio, int error)
 {
-	if (error)
+	/*
+	 * Special case here - hide the -EOPNOTSUPP from the driver or
+	 * block layer, dump a warning the first time this happens so that
+	 * the admin knows that we may not provide the ordering guarantees
+	 * that are needed. Don't clear the uptodate bit.
+	 */
+	if (error == -EOPNOTSUPP && bio_barrier(bio)) {
+		set_bit(BIO_EOPNOTSUPP, &bio->bi_flags);
+		blk_queue_set_noflush(bio->bi_bdev);
+		error = 0;
+	} else if (error)
 		clear_bit(BIO_UPTODATE, &bio->bi_flags);
 	else if (!test_bit(BIO_UPTODATE, &bio->bi_flags))
 		error = -EIO;
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index ebe6b29..d696d26 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -1834,7 +1834,6 @@ extent_bio_alloc(struct block_device *bdev, u64 first_sector, int nr_vecs,
 static int submit_one_bio(int rw, struct bio *bio, int mirror_num,
 			  unsigned long bio_flags)
 {
-	int ret = 0;
 	struct bio_vec *bvec = bio->bi_io_vec + bio->bi_vcnt - 1;
 	struct page *page = bvec->bv_page;
 	struct extent_io_tree *tree = bio->bi_private;
@@ -1846,17 +1845,13 @@ static int submit_one_bio(int rw, struct bio *bio, int mirror_num,
 
 	bio->bi_private = NULL;
 
-	bio_get(bio);
-
 	if (tree->ops && tree->ops->submit_bio_hook)
 		tree->ops->submit_bio_hook(page->mapping->host, rw, bio,
 					   mirror_num, bio_flags);
 	else
 		submit_bio(rw, bio);
-	if (bio_flagged(bio, BIO_EOPNOTSUPP))
-		ret = -EOPNOTSUPP;
-	bio_put(bio);
-	return ret;
+
+	return 0;
 }
 
 static int submit_extent_page(int rw, struct extent_io_tree *tree,
diff --git a/fs/buffer.c b/fs/buffer.c
index a2fd743..6f50e08 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -147,17 +147,9 @@ void end_buffer_read_sync(struct buffer_head *bh, int uptodate)
 
 void end_buffer_write_sync(struct buffer_head *bh, int uptodate)
 {
-	char b[BDEVNAME_SIZE];
-
 	if (uptodate) {
 		set_buffer_uptodate(bh);
 	} else {
-		if (!buffer_eopnotsupp(bh) && !quiet_error(bh)) {
-			buffer_io_error(bh);
-			printk(KERN_WARNING "lost page write due to "
-					"I/O error on %s\n",
-				       bdevname(bh->b_bdev, b));
-		}
 		set_buffer_write_io_error(bh);
 		clear_buffer_uptodate(bh);
 	}
@@ -2828,7 +2820,7 @@ static void end_bio_bh_io_sync(struct bio *bio, int err)
 
 	if (err == -EOPNOTSUPP) {
 		set_bit(BIO_EOPNOTSUPP, &bio->bi_flags);
-		set_bit(BH_Eopnotsupp, &bh->b_state);
+		err = 0;
 	}
 
 	if (unlikely (test_bit(BIO_QUIET,&bio->bi_flags)))
@@ -2841,7 +2833,6 @@ static void end_bio_bh_io_sync(struct bio *bio, int err)
 int submit_bh(int rw, struct buffer_head * bh)
 {
 	struct bio *bio;
-	int ret = 0;
 
 	BUG_ON(!buffer_locked(bh));
 	BUG_ON(!buffer_mapped(bh));
@@ -2879,14 +2870,8 @@ int submit_bh(int rw, struct buffer_head * bh)
 	bio->bi_end_io = end_bio_bh_io_sync;
 	bio->bi_private = bh;
 
-	bio_get(bio);
 	submit_bio(rw, bio);
-
-	if (bio_flagged(bio, BIO_EOPNOTSUPP))
-		ret = -EOPNOTSUPP;
-
-	bio_put(bio);
-	return ret;
+	return 0;
 }
 
 /**
@@ -2965,10 +2950,6 @@ int sync_dirty_buffer(struct buffer_head *bh)
 		bh->b_end_io = end_buffer_write_sync;
 		ret = submit_bh(WRITE, bh);
 		wait_on_buffer(bh);
-		if (buffer_eopnotsupp(bh)) {
-			clear_buffer_eopnotsupp(bh);
-			ret = -EOPNOTSUPP;
-		}
 		if (!ret && !buffer_uptodate(bh))
 			ret = -EIO;
 	} else {
diff --git a/fs/xfs/linux-2.6/xfs_aops.c b/fs/xfs/linux-2.6/xfs_aops.c
index de3a198..c0cacb2 100644
--- a/fs/xfs/linux-2.6/xfs_aops.c
+++ b/fs/xfs/linux-2.6/xfs_aops.c
@@ -406,7 +406,6 @@ xfs_submit_ioend_bio(
 	bio->bi_end_io = xfs_end_bio;
 
 	submit_bio(WRITE, bio);
-	ASSERT(!bio_flagged(bio, BIO_EOPNOTSUPP));
 	bio_put(bio);
 }
 
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 465d6ba..ea2e15a 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -452,6 +452,7 @@ struct request_queue
 #define QUEUE_FLAG_NONROT      14	/* non-rotational device (SSD) */
 #define QUEUE_FLAG_VIRT        QUEUE_FLAG_NONROT /* paravirt device */
 #define QUEUE_FLAG_IO_STAT     15	/* do IO stats */
+#define QUEUE_FLAG_NOFLUSH     16	/* device doesn't do flushes */
 
 #define QUEUE_FLAG_DEFAULT	((1 << QUEUE_FLAG_IO_STAT) |		\
 				 (1 << QUEUE_FLAG_CLUSTER) |		\
@@ -789,6 +790,7 @@ extern int blk_execute_rq(struct request_queue *, struct gendisk *,
 extern void blk_execute_rq_nowait(struct request_queue *, struct gendisk *,
 				  struct request *, int, rq_end_io_fn *);
 extern void blk_unplug(struct request_queue *q);
+extern void blk_queue_set_noflush(struct block_device *bdev);
 
 static inline struct request_queue *bdev_get_queue(struct block_device *bdev)
 {
diff --git a/include/linux/buffer_head.h b/include/linux/buffer_head.h
index f19fd90..8adcaa4 100644
--- a/include/linux/buffer_head.h
+++ b/include/linux/buffer_head.h
@@ -33,7 +33,6 @@ enum bh_state_bits {
 	BH_Boundary,	/* Block is followed by a discontiguity */
 	BH_Write_EIO,	/* I/O error on write */
 	BH_Ordered,	/* ordered write */
-	BH_Eopnotsupp,	/* operation not supported (barrier) */
 	BH_Unwritten,	/* Buffer is allocated on disk but not written */
 	BH_Quiet,	/* Buffer Error Prinks to be quiet */
 
@@ -126,7 +125,6 @@ BUFFER_FNS(Delay, delay)
 BUFFER_FNS(Boundary, boundary)
 BUFFER_FNS(Write_EIO, write_io_error)
 BUFFER_FNS(Ordered, ordered)
-BUFFER_FNS(Eopnotsupp, eopnotsupp)
 BUFFER_FNS(Unwritten, unwritten)
 
 #define bh_offset(bh)		((unsigned long)(bh)->b_data & ~PAGE_MASK)

-- 
Jens Axboe


^ permalink raw reply related	[flat|nested] 664+ messages in thread

* Re: [PATCH 1/7] block: Add block_flush_device()
  2009-03-31 16:43                                                                               ` Jens Axboe
@ 2009-03-31 16:57                                                                                 ` Linus Torvalds
  2009-03-31 17:19                                                                                   ` Jens Axboe
  2009-03-31 17:03                                                                                 ` Jens Axboe
  1 sibling, 1 reply; 664+ messages in thread
From: Linus Torvalds @ 2009-03-31 16:57 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Ric Wheeler, Fernando Luis Vázquez Cao, Jeff Garzik,
	Christoph Hellwig, Theodore Tso, Ingo Molnar, Alan Cox,
	Arjan van de Ven, Andrew Morton, Peter Zijlstra, Nick Piggin,
	David Rees, Jesper Krogh, Linux Kernel Mailing List, chris.mason,
	david, tj



On Tue, 31 Mar 2009, Jens Axboe wrote:
> 
> So here's a test patch that attempts to just ignore such a failure to
> flush the caches.

I suspect you should not do it like this.

> diff --git a/fs/bio.c b/fs/bio.c
> index a040cde..79e3cec 100644
> --- a/fs/bio.c
> +++ b/fs/bio.c
> @@ -1380,7 +1380,17 @@ void bio_check_pages_dirty(struct bio *bio)
>   **/
>  void bio_endio(struct bio *bio, int error)
>  {
> -	if (error)
> +	/*
> +	 * Special case here - hide the -EOPNOTSUPP from the driver or
> +	 * block layer, dump a warning the first time this happens so that
> +	 * the admin knows that we may not provide the ordering guarantees
> +	 * that are needed. Don't clear the uptodate bit.
> +	 */
> +	if (error == -EOPNOTSUPP && bio_barrier(bio)) {
> +		set_bit(BIO_EOPNOTSUPP, &bio->bi_flags);
> +		blk_queue_set_noflush(bio->bi_bdev);
> +		error = 0;
> +	} else if (error)

I suspect this part is just wrong.

I could easily imagine a driver that returns EOPNOTSUPP only for a certain 
_kind_ of bio.

For example, if the drive doesn't support FUA, then you cannot do a 
serialized IO operation, but you can still mostly do a serialized op 
without any IO attached to it.

IOW, the "empty flush" really _is_ special. An this check should not be in 
the generic "bio_endio()" case, it should only be in the special 
blkdev_issue_flush() case.

I think. No?

		Linus

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: [PATCH 1/7] block: Add block_flush_device()
  2009-03-31 16:43                                                                               ` Jens Axboe
  2009-03-31 16:57                                                                                 ` Linus Torvalds
@ 2009-03-31 17:03                                                                                 ` Jens Axboe
  2009-04-01  0:43                                                                                   ` Tejun Heo
  1 sibling, 1 reply; 664+ messages in thread
From: Jens Axboe @ 2009-03-31 17:03 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ric Wheeler, Fernando Luis Vázquez Cao, Jeff Garzik,
	Christoph Hellwig, Theodore Tso, Ingo Molnar, Alan Cox,
	Arjan van de Ven, Andrew Morton, Peter Zijlstra, Nick Piggin,
	David Rees, Jesper Krogh, Linux Kernel Mailing List, chris.mason,
	david, tj

On Tue, Mar 31 2009, Jens Axboe wrote:
> On Tue, Mar 31 2009, Linus Torvalds wrote:
> > 
> > 
> > On Tue, 31 Mar 2009, Ric Wheeler wrote:
> > > 
> > > The question is really what we do when you have a storage device in your box
> > > with a volatile write cache that does support flush or fua or similar.
> > 
> > Ok. Then you are talking about a different case - not EOPNOTSUPP.
> 
> So here's a test patch that attempts to just ignore such a failure to
> flush the caches. It will still flag the bio as BIO_EOPNOTSUPP, but
> that's merely maintaining the information in case the caller does want
> to see if that barrier failed or not. It may not actually be useful, in
> which case we can just kill that flag.

Updated version, the previous missed most of the buffer_eopnotsupp()
checking. So this one also gets rid of the file system retry logic.
Thanks to gfs2 Steve for pointing out that I missed gfs2, made me
realize that I missed a lot more as well.


 block/blk-barrier.c         |    8 ++------
 block/blk-settings.c        |   13 +++++++++++++
 block/ioctl.c               |    4 +---
 fs/bio.c                    |   12 +++++++++++-
 fs/btrfs/disk-io.c          |    5 -----
 fs/btrfs/extent_io.c        |    9 ++-------
 fs/buffer.c                 |   23 ++---------------------
 fs/fat/misc.c               |    5 +----
 fs/gfs2/log.c               |   18 ++++++------------
 fs/jbd2/commit.c            |   22 ----------------------
 fs/reiserfs/journal.c       |   15 ---------------
 fs/xfs/linux-2.6/xfs_aops.c |    1 -
 include/linux/blkdev.h      |    2 ++
 include/linux/buffer_head.h |    2 --
 14 files changed, 40 insertions(+), 99 deletions(-)commit 74e725b7f2e5f3f073abe84c5823026a6f1e33ce

---

Author: Jens Axboe <jens.axboe@oracle.com>
Date:   Tue Mar 31 19:00:53 2009 +0200

    barrier: Don't return -EOPNOTSUPP to the caller if the device does not support barriers
    
    The caller cannot really do much about the situation anyway. Instead log
    a warning if this is the first such failed barrier we see, so that the
    admin can look into whether this poses a data integrity problem or not.
    
    Signed-off-by: Jens Axboe <jens.axboe@oracle.com>

diff --git a/block/blk-barrier.c b/block/blk-barrier.c
index f7dae57..8660146 100644
--- a/block/blk-barrier.c
+++ b/block/blk-barrier.c
@@ -338,9 +338,7 @@ int blkdev_issue_flush(struct block_device *bdev, sector_t *error_sector)
 		*error_sector = bio->bi_sector;
 
 	ret = 0;
-	if (bio_flagged(bio, BIO_EOPNOTSUPP))
-		ret = -EOPNOTSUPP;
-	else if (!bio_flagged(bio, BIO_UPTODATE))
+	if (!bio_flagged(bio, BIO_UPTODATE))
 		ret = -EIO;
 
 	bio_put(bio);
@@ -408,9 +406,7 @@ int blkdev_issue_discard(struct block_device *bdev,
 		submit_bio(DISCARD_BARRIER, bio);
 
 		/* Check if it failed immediately */
-		if (bio_flagged(bio, BIO_EOPNOTSUPP))
-			ret = -EOPNOTSUPP;
-		else if (!bio_flagged(bio, BIO_UPTODATE))
+		if (!bio_flagged(bio, BIO_UPTODATE))
 			ret = -EIO;
 		bio_put(bio);
 	}
diff --git a/block/blk-settings.c b/block/blk-settings.c
index 59fd05d..0a81466 100644
--- a/block/blk-settings.c
+++ b/block/blk-settings.c
@@ -463,6 +463,19 @@ void blk_queue_update_dma_alignment(struct request_queue *q, int mask)
 }
 EXPORT_SYMBOL(blk_queue_update_dma_alignment);
 
+void blk_queue_set_noflush(struct block_device *bdev)
+{
+	struct request_queue *q = bdev_get_queue(bdev);
+	char b[BDEVNAME_SIZE];
+
+	if (test_and_set_bit(QUEUE_FLAG_NOFLUSH, &q->queue_flags))
+		return;
+
+	printk(KERN_ERR "Device %s does not appear to honor cache flushes. "	
+			"This may mean that file system ordering guarantees "
+			"are not met.", bdevname(bdev, b));
+}
+
 static int __init blk_settings_init(void)
 {
 	blk_max_low_pfn = max_low_pfn - 1;
diff --git a/block/ioctl.c b/block/ioctl.c
index 0f22e62..769f7be 100644
--- a/block/ioctl.c
+++ b/block/ioctl.c
@@ -166,9 +166,7 @@ static int blk_ioctl_discard(struct block_device *bdev, uint64_t start,
 
 		wait_for_completion(&wait);
 
-		if (bio_flagged(bio, BIO_EOPNOTSUPP))
-			ret = -EOPNOTSUPP;
-		else if (!bio_flagged(bio, BIO_UPTODATE))
+		if (!bio_flagged(bio, BIO_UPTODATE))
 			ret = -EIO;
 		bio_put(bio);
 	}
diff --git a/fs/bio.c b/fs/bio.c
index a040cde..79e3cec 100644
--- a/fs/bio.c
+++ b/fs/bio.c
@@ -1380,7 +1380,17 @@ void bio_check_pages_dirty(struct bio *bio)
  **/
 void bio_endio(struct bio *bio, int error)
 {
-	if (error)
+	/*
+	 * Special case here - hide the -EOPNOTSUPP from the driver or
+	 * block layer, dump a warning the first time this happens so that
+	 * the admin knows that we may not provide the ordering guarantees
+	 * that are needed. Don't clear the uptodate bit.
+	 */
+	if (error == -EOPNOTSUPP && bio_barrier(bio)) {
+		set_bit(BIO_EOPNOTSUPP, &bio->bi_flags);
+		blk_queue_set_noflush(bio->bi_bdev);
+		error = 0;
+	} else if (error)
 		clear_bit(BIO_UPTODATE, &bio->bi_flags);
 	else if (!test_bit(BIO_UPTODATE, &bio->bi_flags))
 		error = -EIO;
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 6ec80c0..fd3ea97 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -1951,11 +1951,6 @@ static void btrfs_end_buffer_write_sync(struct buffer_head *bh, int uptodate)
 	if (uptodate) {
 		set_buffer_uptodate(bh);
 	} else {
-		if (!buffer_eopnotsupp(bh) && printk_ratelimit()) {
-			printk(KERN_WARNING "lost page write due to "
-					"I/O error on %s\n",
-				       bdevname(bh->b_bdev, b));
-		}
 		/* note, we dont' set_buffer_write_io_error because we have
 		 * our own ways of dealing with the IO errors
 		 */
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index ebe6b29..d696d26 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -1834,7 +1834,6 @@ extent_bio_alloc(struct block_device *bdev, u64 first_sector, int nr_vecs,
 static int submit_one_bio(int rw, struct bio *bio, int mirror_num,
 			  unsigned long bio_flags)
 {
-	int ret = 0;
 	struct bio_vec *bvec = bio->bi_io_vec + bio->bi_vcnt - 1;
 	struct page *page = bvec->bv_page;
 	struct extent_io_tree *tree = bio->bi_private;
@@ -1846,17 +1845,13 @@ static int submit_one_bio(int rw, struct bio *bio, int mirror_num,
 
 	bio->bi_private = NULL;
 
-	bio_get(bio);
-
 	if (tree->ops && tree->ops->submit_bio_hook)
 		tree->ops->submit_bio_hook(page->mapping->host, rw, bio,
 					   mirror_num, bio_flags);
 	else
 		submit_bio(rw, bio);
-	if (bio_flagged(bio, BIO_EOPNOTSUPP))
-		ret = -EOPNOTSUPP;
-	bio_put(bio);
-	return ret;
+
+	return 0;
 }
 
 static int submit_extent_page(int rw, struct extent_io_tree *tree,
diff --git a/fs/buffer.c b/fs/buffer.c
index a2fd743..6f50e08 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -147,17 +147,9 @@ void end_buffer_read_sync(struct buffer_head *bh, int uptodate)
 
 void end_buffer_write_sync(struct buffer_head *bh, int uptodate)
 {
-	char b[BDEVNAME_SIZE];
-
 	if (uptodate) {
 		set_buffer_uptodate(bh);
 	} else {
-		if (!buffer_eopnotsupp(bh) && !quiet_error(bh)) {
-			buffer_io_error(bh);
-			printk(KERN_WARNING "lost page write due to "
-					"I/O error on %s\n",
-				       bdevname(bh->b_bdev, b));
-		}
 		set_buffer_write_io_error(bh);
 		clear_buffer_uptodate(bh);
 	}
@@ -2828,7 +2820,7 @@ static void end_bio_bh_io_sync(struct bio *bio, int err)
 
 	if (err == -EOPNOTSUPP) {
 		set_bit(BIO_EOPNOTSUPP, &bio->bi_flags);
-		set_bit(BH_Eopnotsupp, &bh->b_state);
+		err = 0;
 	}
 
 	if (unlikely (test_bit(BIO_QUIET,&bio->bi_flags)))
@@ -2841,7 +2833,6 @@ static void end_bio_bh_io_sync(struct bio *bio, int err)
 int submit_bh(int rw, struct buffer_head * bh)
 {
 	struct bio *bio;
-	int ret = 0;
 
 	BUG_ON(!buffer_locked(bh));
 	BUG_ON(!buffer_mapped(bh));
@@ -2879,14 +2870,8 @@ int submit_bh(int rw, struct buffer_head * bh)
 	bio->bi_end_io = end_bio_bh_io_sync;
 	bio->bi_private = bh;
 
-	bio_get(bio);
 	submit_bio(rw, bio);
-
-	if (bio_flagged(bio, BIO_EOPNOTSUPP))
-		ret = -EOPNOTSUPP;
-
-	bio_put(bio);
-	return ret;
+	return 0;
 }
 
 /**
@@ -2965,10 +2950,6 @@ int sync_dirty_buffer(struct buffer_head *bh)
 		bh->b_end_io = end_buffer_write_sync;
 		ret = submit_bh(WRITE, bh);
 		wait_on_buffer(bh);
-		if (buffer_eopnotsupp(bh)) {
-			clear_buffer_eopnotsupp(bh);
-			ret = -EOPNOTSUPP;
-		}
 		if (!ret && !buffer_uptodate(bh))
 			ret = -EIO;
 	} else {
diff --git a/fs/fat/misc.c b/fs/fat/misc.c
index ac39ebc..406da54 100644
--- a/fs/fat/misc.c
+++ b/fs/fat/misc.c
@@ -269,10 +269,7 @@ int fat_sync_bhs(struct buffer_head **bhs, int nr_bhs)
 	ll_rw_block(SWRITE, nr_bhs, bhs);
 	for (i = 0; i < nr_bhs; i++) {
 		wait_on_buffer(bhs[i]);
-		if (buffer_eopnotsupp(bhs[i])) {
-			clear_buffer_eopnotsupp(bhs[i]);
-			err = -EOPNOTSUPP;
-		} else if (!err && !buffer_uptodate(bhs[i]))
+		if (!err && !buffer_uptodate(bhs[i]))
 			err = -EIO;
 	}
 	return err;
diff --git a/fs/gfs2/log.c b/fs/gfs2/log.c
index 98918a7..78c3d59 100644
--- a/fs/gfs2/log.c
+++ b/fs/gfs2/log.c
@@ -578,6 +578,7 @@ static void log_write_header(struct gfs2_sbd *sdp, u32 flags, int pull)
 	struct gfs2_log_header *lh;
 	unsigned int tail;
 	u32 hash;
+	int rw;
 
 	bh = sb_getblk(sdp->sd_vfs, blkno);
 	lock_buffer(bh);
@@ -602,20 +603,13 @@ static void log_write_header(struct gfs2_sbd *sdp, u32 flags, int pull)
 
 	bh->b_end_io = end_buffer_write_sync;
 	if (test_bit(SDF_NOBARRIERS, &sdp->sd_flags))
-		goto skip_barrier;
+		rw = WRITE;
+	else
+		rw = WRITE_BARRIER;
+
 	get_bh(bh);
-	submit_bh(WRITE_BARRIER | (1 << BIO_RW_META), bh);
+	submit_bh(rw | (1 << BIO_RW_META), bh);
 	wait_on_buffer(bh);
-	if (buffer_eopnotsupp(bh)) {
-		clear_buffer_eopnotsupp(bh);
-		set_buffer_uptodate(bh);
-		set_bit(SDF_NOBARRIERS, &sdp->sd_flags);
-		lock_buffer(bh);
-skip_barrier:
-		get_bh(bh);
-		submit_bh(WRITE_SYNC | (1 << BIO_RW_META), bh);
-		wait_on_buffer(bh);
-	}
 	if (!buffer_uptodate(bh))
 		gfs2_io_error_bh(sdp, bh);
 	brelse(bh);
diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c
index 62804e5..c1de70c 100644
--- a/fs/jbd2/commit.c
+++ b/fs/jbd2/commit.c
@@ -174,30 +174,8 @@ static int journal_wait_on_commit_record(journal_t *journal,
 {
 	int ret = 0;
 
-retry:
 	clear_buffer_dirty(bh);
 	wait_on_buffer(bh);
-	if (buffer_eopnotsupp(bh) && (journal->j_flags & JBD2_BARRIER)) {
-		printk(KERN_WARNING
-		       "JBD2: wait_on_commit_record: sync failed on %s - "
-		       "disabling barriers\n", journal->j_devname);
-		spin_lock(&journal->j_state_lock);
-		journal->j_flags &= ~JBD2_BARRIER;
-		spin_unlock(&journal->j_state_lock);
-
-		lock_buffer(bh);
-		clear_buffer_dirty(bh);
-		set_buffer_uptodate(bh);
-		bh->b_end_io = journal_end_buffer_io_sync;
-
-		ret = submit_bh(WRITE_SYNC, bh);
-		if (ret) {
-			unlock_buffer(bh);
-			return ret;
-		}
-		goto retry;
-	}
-
 	if (unlikely(!buffer_uptodate(bh)))
 		ret = -EIO;
 	put_bh(bh);            /* One for getblk() */
diff --git a/fs/reiserfs/journal.c b/fs/reiserfs/journal.c
index 77f5bb7..5eefa6c 100644
--- a/fs/reiserfs/journal.c
+++ b/fs/reiserfs/journal.c
@@ -700,18 +700,6 @@ static int submit_barrier_buffer(struct buffer_head *bh)
 	return submit_bh(WRITE_BARRIER, bh);
 }
 
-static void check_barrier_completion(struct super_block *s,
-				     struct buffer_head *bh)
-{
-	if (buffer_eopnotsupp(bh)) {
-		clear_buffer_eopnotsupp(bh);
-		disable_barrier(s);
-		set_buffer_uptodate(bh);
-		set_buffer_dirty(bh);
-		sync_dirty_buffer(bh);
-	}
-}
-
 #define CHUNK_SIZE 32
 struct buffer_chunk {
 	struct buffer_head *bh[CHUNK_SIZE];
@@ -1148,8 +1136,6 @@ static int flush_commit_list(struct super_block *s,
 	} else
 		wait_on_buffer(jl->j_commit_bh);
 
-	check_barrier_completion(s, jl->j_commit_bh);
-
 	/* If there was a write error in the journal - we can't commit this
 	 * transaction - it will be invalid and, if successful, will just end
 	 * up propagating the write error out to the filesystem. */
@@ -1313,7 +1299,6 @@ static int _update_journal_header_block(struct super_block *sb,
 				goto sync;
 			}
 			wait_on_buffer(journal->j_header_bh);
-			check_barrier_completion(sb, journal->j_header_bh);
 		} else {
 		      sync:
 			set_buffer_dirty(journal->j_header_bh);
diff --git a/fs/xfs/linux-2.6/xfs_aops.c b/fs/xfs/linux-2.6/xfs_aops.c
index de3a198..c0cacb2 100644
--- a/fs/xfs/linux-2.6/xfs_aops.c
+++ b/fs/xfs/linux-2.6/xfs_aops.c
@@ -406,7 +406,6 @@ xfs_submit_ioend_bio(
 	bio->bi_end_io = xfs_end_bio;
 
 	submit_bio(WRITE, bio);
-	ASSERT(!bio_flagged(bio, BIO_EOPNOTSUPP));
 	bio_put(bio);
 }
 
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 465d6ba..ea2e15a 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -452,6 +452,7 @@ struct request_queue
 #define QUEUE_FLAG_NONROT      14	/* non-rotational device (SSD) */
 #define QUEUE_FLAG_VIRT        QUEUE_FLAG_NONROT /* paravirt device */
 #define QUEUE_FLAG_IO_STAT     15	/* do IO stats */
+#define QUEUE_FLAG_NOFLUSH     16	/* device doesn't do flushes */
 
 #define QUEUE_FLAG_DEFAULT	((1 << QUEUE_FLAG_IO_STAT) |		\
 				 (1 << QUEUE_FLAG_CLUSTER) |		\
@@ -789,6 +790,7 @@ extern int blk_execute_rq(struct request_queue *, struct gendisk *,
 extern void blk_execute_rq_nowait(struct request_queue *, struct gendisk *,
 				  struct request *, int, rq_end_io_fn *);
 extern void blk_unplug(struct request_queue *q);
+extern void blk_queue_set_noflush(struct block_device *bdev);
 
 static inline struct request_queue *bdev_get_queue(struct block_device *bdev)
 {
diff --git a/include/linux/buffer_head.h b/include/linux/buffer_head.h
index f19fd90..8adcaa4 100644
--- a/include/linux/buffer_head.h
+++ b/include/linux/buffer_head.h
@@ -33,7 +33,6 @@ enum bh_state_bits {
 	BH_Boundary,	/* Block is followed by a discontiguity */
 	BH_Write_EIO,	/* I/O error on write */
 	BH_Ordered,	/* ordered write */
-	BH_Eopnotsupp,	/* operation not supported (barrier) */
 	BH_Unwritten,	/* Buffer is allocated on disk but not written */
 	BH_Quiet,	/* Buffer Error Prinks to be quiet */
 
@@ -126,7 +125,6 @@ BUFFER_FNS(Delay, delay)
 BUFFER_FNS(Boundary, boundary)
 BUFFER_FNS(Write_EIO, write_io_error)
 BUFFER_FNS(Ordered, ordered)
-BUFFER_FNS(Eopnotsupp, eopnotsupp)
 BUFFER_FNS(Unwritten, unwritten)
 
 #define bh_offset(bh)		((unsigned long)(bh)->b_data & ~PAGE_MASK)


-- 
Jens Axboe


^ permalink raw reply related	[flat|nested] 664+ messages in thread

* Re: [PATCH 1/7] block: Add block_flush_device()
  2009-03-31 16:15                                                                             ` Linus Torvalds
  2009-03-31 16:43                                                                               ` Jens Axboe
@ 2009-03-31 17:14                                                                               ` Ric Wheeler
  1 sibling, 0 replies; 664+ messages in thread
From: Ric Wheeler @ 2009-03-31 17:14 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jens Axboe, Fernando Luis Vázquez Cao, Jeff Garzik,
	Christoph Hellwig, Theodore Tso, Ingo Molnar, Alan Cox,
	Arjan van de Ven, Andrew Morton, Peter Zijlstra, Nick Piggin,
	David Rees, Jesper Krogh, Linux Kernel Mailing List, chris.mason,
	david, tj

Linus Torvalds wrote:
> 
> On Tue, 31 Mar 2009, Ric Wheeler wrote:
>> The question is really what we do when you have a storage device in your box
>> with a volatile write cache that does support flush or fua or similar.
> 
> Ok. Then you are talking about a different case - not EOPNOTSUPP.
> 
> [ Although it may be related in that maybe the admin can _force_ a 
>   EOPNOTSUPP thing for when he wants to disable any "write barrier implies 
>   flush" thing.
> 
>   IOW, we may end up with an _implementation_ detail where we overload a 
>   potential QUEUE_FLUSH_EOPNOTSUPP flag with two meanings - either "the 
>   driver told me a barrier isn't supported" or "the admin set that same 
>   flag by hand to disable barrier-related flush commands".
> 
>   But that's just an implementation detail, of course. We could use two 
>   different flags, we could do the flags at different levels, whatever. ]
> 
>> Using barriers & ordered transactions for these types of devices will 
>> give you a more reliable file system - less fsck time needed and better 
>> data integrity support for the (few?) applications that use fsync 
>> properly.
> 
> Sure. And it still shouldn't be the filesystem that _requires_ use of it.

That sounds reasonable enough. The key thing is how to squeeze as much 
reliability as possible out of what ever you have at the time.

> 
> The user (or low-level driver) may simply know better. The user may 
> know that he trusts the disk more than anything else, and prefers to 
> not actually emit the "FLUSH" command. Again, that's not something that 
> the filesystem should know about, or care about. If the user trusts the 
> disk subsystem and wants the performance, it's the users choice.
> 
> Even the _driver_ may know better.

True - high end arrays (as you mention below) will probably ack a flush request 
without flushing data, basically turning them into noops.

> Knowing the kinds of firmware bugs those drives have, it could even be a 
> driver that simply black-lists certain disks as having known-broken FLUSH 
> commands. We have _CPU's_ that corrupt memory on cache writeback 
> ("wbinvl"), and those things are a lot more tested than most driver 
> firmware is.
> 
> Do you realize just how buggy some of those flash drives are? Some of them 
> will literally (a) report the wrong size and (b) lock up if you try to 
> read from the last sector. Oops. Do you really expect such crap to 
> even bother to honor some flush command? Good luck with that. They're 
> designed as a floppy replacement.

Sure - really cheap & crappy storage is easy enough to find. Definitely I agree 
that flush barriers would be wasted on them.

> 
> Now, you can tell me that I shouldn't put a reliable filesystem on an 
> el-cheapo flash drive and expect it to work, but I'm sorry, you're wrong. 
> People _are_ supposed to be able to move their data around, and the 
> filesystem shouldn't make judgement calls. If you want judgement calls, 
> call your mom. Not your filesystem.

File systems should try to do their best job with what they have, but we might 
also want to use a non-transaction based file system (ext2? ext4 w/o the journal 
like google?). Again, as you suggest, users (or distro installers?) can make 
that kind of choice.

> For another example, the driver might be a driver for a high-end 
> battery-backup SCSI RAID controller. It knows that the controller _will_ 
> write things out in the right order even in the case of a crash, but it 
> may also know that the controller _also_ has a way to force a flush to 
> actual hardware.
> 
> When do you want to force a flush? For hotplug events, for example. Maybe 
> the disks won't be _connected_ any more afterwards - then the battery 
> backup on the controller won't be helping, will it? So there may well be a 
> flush event thing, but it's really up to the admin to decide whether it 
> should be connected to a write barrier thing, or be a separate admin 
> activity.

For non-volatile write caches like these, you don't need to "flush" the storage 
write cache, you just need to move the data to the storage in the correct order.

As far as I know, non of this kind of information is exposed to higher levels in 
a standard way, so what people do today is to disable barriers (or assume, 
correctly as far as I know, that arrays will drop the flush requests :-))


> 
> Maybe the admin is extra careful and anal, and decides that he wants to 
> flush to disk platters _despite_ the battery backup. Maybe he doesn't 
> trust the card. Maybe he does.  Whatever. The point is that the admin 
> might want to set a driver flag that does the flush or not, adn it's 
> totally not a filesystem issue.
> 
> See? The filesystem has absolutely _no_place_ deciding these kinds of 
> things. The only thing it can ask for is "please serialize", but what 
> _level_ of serialization is simply not a filesystem decision to make.
> 
> And that very much includes the level of serialization that says "no 
> serialization what-so-ever, and please go absolutely crazy with your 
> cache". Not your choice.
> 
> So no, you can't have a pony.
> 
> 			Linus


No room for a pony in my yard in any case :-)

ric



^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: [PATCH 1/7] block: Add block_flush_device()
  2009-03-31 16:57                                                                                 ` Linus Torvalds
@ 2009-03-31 17:19                                                                                   ` Jens Axboe
  2009-04-01  0:54                                                                                     ` Tejun Heo
  0 siblings, 1 reply; 664+ messages in thread
From: Jens Axboe @ 2009-03-31 17:19 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ric Wheeler, Fernando Luis Vázquez Cao, Jeff Garzik,
	Christoph Hellwig, Theodore Tso, Ingo Molnar, Alan Cox,
	Arjan van de Ven, Andrew Morton, Peter Zijlstra, Nick Piggin,
	David Rees, Jesper Krogh, Linux Kernel Mailing List, chris.mason,
	david, tj

On Tue, Mar 31 2009, Linus Torvalds wrote:
> 
> 
> On Tue, 31 Mar 2009, Jens Axboe wrote:
> > 
> > So here's a test patch that attempts to just ignore such a failure to
> > flush the caches.
> 
> I suspect you should not do it like this.
> 
> > diff --git a/fs/bio.c b/fs/bio.c
> > index a040cde..79e3cec 100644
> > --- a/fs/bio.c
> > +++ b/fs/bio.c
> > @@ -1380,7 +1380,17 @@ void bio_check_pages_dirty(struct bio *bio)
> >   **/
> >  void bio_endio(struct bio *bio, int error)
> >  {
> > -	if (error)
> > +	/*
> > +	 * Special case here - hide the -EOPNOTSUPP from the driver or
> > +	 * block layer, dump a warning the first time this happens so that
> > +	 * the admin knows that we may not provide the ordering guarantees
> > +	 * that are needed. Don't clear the uptodate bit.
> > +	 */
> > +	if (error == -EOPNOTSUPP && bio_barrier(bio)) {
> > +		set_bit(BIO_EOPNOTSUPP, &bio->bi_flags);
> > +		blk_queue_set_noflush(bio->bi_bdev);
> > +		error = 0;
> > +	} else if (error)
> 
> I suspect this part is just wrong.
> 
> I could easily imagine a driver that returns EOPNOTSUPP only for a certain 
> _kind_ of bio.
> 
> For example, if the drive doesn't support FUA, then you cannot do a 
> serialized IO operation, but you can still mostly do a serialized op 
> without any IO attached to it.

FUA we should be able to reliably detect, it's really the cache flush
operation itself that has caused headaches in the past. The -EOPNOTSUPP
really comes from the block layer, not from the device driver. That's
mainly due to the fact that we only send down the actual barrier, if the
driver already said it supported them. If they do fail them, we probably
need to pick up the -EIO bits and pieces and pretend it didn't happen as
well. So it definitely needs more looking into, auditing, and testing.
I'll do that tomorrow.

> IOW, the "empty flush" really _is_ special. An this check should not be in 
> the generic "bio_endio()" case, it should only be in the special 
> blkdev_issue_flush() case.
> 
> I think. No?

The empty flush is special and it is easy to fix that by itself. That
should probably be the first patch in the series. But the retry logic
and such for actual write barriers are the majority of the problems
involved with supporting barriers, and those I want to get rid of.

I think it'll be more clear when I post a real patch series with the
individual steps outlined.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-29  1:18                                                                             ` Jeff Garzik
@ 2009-03-31 18:45                                                                               ` Jörn Engel
  0 siblings, 0 replies; 664+ messages in thread
From: Jörn Engel @ 2009-03-31 18:45 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Mark Lord, Stefan Richter, Linus Torvalds, Matthew Garrett,
	Alan Cox, Theodore Tso, Andrew Morton, David Rees, Jesper Krogh,
	Linux Kernel Mailing List

On Sat, 28 March 2009 21:18:51 -0400, Jeff Garzik wrote:
> 
> Was the BSD soft-updates idea of FS data-before-metadata a good one? 
> Yes.  Obviously.
> 
> It is the cornerstone of every SANE journalling-esque database or 
> filesystem out there -- don't leave a window where your metadata is 
> inconsistent.  "Duh" :)

Your idea of 'consistent' seems a bit fuzzy.  Soft updates, afaiu, leave
plenty of windows and reasons to run fsck.  They only guarantee that all
those windows result in lost space - data allocations without any
references.  It certainly prevents the worst problems, but I would use a
different word for it. :)

Jörn

-- 
Don't worry about people stealing your ideas. If your ideas are any good,
you'll have to ram them down people's throats.
-- Howard Aiken quoted by Ken Iverson quoted by Jim Horning quoted by
   Raph Levien, 1979

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: [PATCH 1/7] block: Add block_flush_device()
  2009-03-31 15:41                                                                           ` Ric Wheeler
  2009-03-31 16:15                                                                             ` Linus Torvalds
@ 2009-03-31 19:25                                                                             ` Mark Lord
  1 sibling, 0 replies; 664+ messages in thread
From: Mark Lord @ 2009-03-31 19:25 UTC (permalink / raw)
  To: Ric Wheeler
  Cc: Linus Torvalds, Jens Axboe, Fernando Luis Vázquez Cao,
	Jeff Garzik, Christoph Hellwig, Theodore Tso, Ingo Molnar,
	Alan Cox, Arjan van de Ven, Andrew Morton, Peter Zijlstra,
	Nick Piggin, David Rees, Jesper Krogh, Linux Kernel Mailing List,
	chris.mason, david, tj

Ric Wheeler wrote:
..
> As Mark pointed out, most S-ATA/SAS drives will flush the write cache 
> when they see a bus reset so even without barriers, the cache will be 
> preserved (or flushed) after a reboot or panic.  Power outages are the 
> problem barriers/flushes are meant to help with.
..

I still see barriers as a separate issue from flushes.
Flushes are there for power failures and hot-removable devices.

Barriers are there for that, but also for better odds of data integrity
in the even of a filesystem or kernel crash.

Even if I don't want the kernel needlessly flushing my battery-backed
write caches, I still do want the barrier ordering that improves the
odds of filesystem consistency in the event of a kernel crash.

Cheers

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-30 22:07           ` Arjan van de Ven
                               ` (2 preceding siblings ...)
  2009-03-31 15:30             ` Hans-Peter Jansen
@ 2009-03-31 19:37             ` Jeff Garzik
  2009-03-31 19:47               ` Arjan van de Ven
  3 siblings, 1 reply; 664+ messages in thread
From: Jeff Garzik @ 2009-03-31 19:37 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Hans-Peter Jansen, Linus Torvalds, Mike Galbraith,
	Geert Uytterhoeven, linux-kernel

Arjan van de Ven wrote:
> Hans-Peter Jansen wrote:
>> Am Freitag, 27. März 2009 schrieb Linus Torvalds:
>>
>> I always wonder, why Arjan does not intervene for his kerneloops.org 
>> project, since your approach opens a window of uncertainty during the 
>> merge window when simply using git as an efficient fetch tool.
> 
> I would *love* it if Linus would, as first commit mark his tree as "-git0"
> (as per snapshots) or "-rc0". So that I can split the "final" versus
> "merge window" oopses.

Can't you discern that from the v$VERSION tag?  According to your 
definition, -git0 would simply be v2.6.29 commit + 1, correct?

	Jeff





^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-31 19:37             ` Jeff Garzik
@ 2009-03-31 19:47               ` Arjan van de Ven
  0 siblings, 0 replies; 664+ messages in thread
From: Arjan van de Ven @ 2009-03-31 19:47 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Hans-Peter Jansen, Linus Torvalds, Mike Galbraith,
	Geert Uytterhoeven, linux-kernel

Jeff Garzik wrote:
> Arjan van de Ven wrote:
>> Hans-Peter Jansen wrote:
>>> Am Freitag, 27. März 2009 schrieb Linus Torvalds:
>>>
>>> I always wonder, why Arjan does not intervene for his kerneloops.org 
>>> project, since your approach opens a window of uncertainty during the 
>>> merge window when simply using git as an efficient fetch tool.
>>
>> I would *love* it if Linus would, as first commit mark his tree as 
>> "-git0"
>> (as per snapshots) or "-rc0". So that I can split the "final" versus
>> "merge window" oopses.
> 
> Can't you discern that from the v$VERSION tag?  According to your 
> definition, -git0 would simply be v2.6.29 commit + 1, correct?

it needs to be something that is shown in the oops output...
... basically version or extraversion in the Makefile.

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-25 19:43                           ` Jens Axboe
  2009-03-25 19:49                             ` Ric Wheeler
  2009-03-25 20:25                             ` Jeff Garzik
@ 2009-03-31 20:49                             ` Jeff Garzik
  2009-03-31 22:02                               ` Ric Wheeler
  2 siblings, 1 reply; 664+ messages in thread
From: Jeff Garzik @ 2009-03-31 20:49 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Linus Torvalds, Theodore Tso, Ingo Molnar, Alan Cox,
	Arjan van de Ven, Andrew Morton, Peter Zijlstra, Nick Piggin,
	David Rees, Jesper Krogh, Linux Kernel Mailing List

Jens Axboe wrote:
> Another problem is that FLUSH_CACHE sucks. Really. And not just on
> ext3/ordered, generally. Write a 50 byte file, fsync, flush cache and
> wit for the world to finish. Pretty hard to teach people to use a nicer
> fdatasync(), when the majority of the cost now becomes flushing the
> cache of that 1TB drive you happen to have 8 partitions on. Good luck
> with that.

(responding to an email way back near the start of the thread)

I emailed Microsoft about their proposal to add a WRITE BARRIER command 
to ATA, documented at
http://www.t13.org/Documents/UploadedDocuments/docs2007/e07174r0-Write_Barrier_Command_Proposal.doc

The MSFT engineer said they were definitely still pursuing this proposal.

IMO we could look at this too, or perhaps come up with an alternate 
proposal like FLUSH CACHE RANGE(s).

	Jeff




^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-30 16:34                                                                                             ` Linus Torvalds
  2009-03-30 17:11                                                                                               ` Ric Wheeler
@ 2009-03-31 21:10                                                                                               ` Alan Cox
  2009-03-31 21:55                                                                                                 ` Linus Torvalds
  1 sibling, 1 reply; 664+ messages in thread
From: Alan Cox @ 2009-03-31 21:10 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ric Wheeler, Andreas T.Auer, Theodore Tso, Mark Lord,
	Stefan Richter, Jeff Garzik, Matthew Garrett, Andrew Morton,
	David Rees, Jesper Krogh, Linux Kernel Mailing List

> percentage. For many setups, the other corruption issues (drive failure) 
> are not just more common, but generally more disastrous anyway. So why 
> would a person like that worry about the (rare) power failure?

How about the far more regular crash case ? We may be pretty reliable but
we are hardly indestructible especially on random boxes with funky BIOSes
or low grade hardware builds.

For the generic sane low end server/high end desktop build with at least
two drive software RAID the hardware failure for data loss case is
pretty rare. Crashes yes, having to reboot to recover from a RAID failure
sure but data loss far less so

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-30 19:22                                                                                                   ` Rik van Riel
                                                                                                                       ` (2 preceding siblings ...)
  2009-03-31  9:27                                                                                                     ` Neil Brown
@ 2009-03-31 21:13                                                                                                     ` Alan Cox
  3 siblings, 0 replies; 664+ messages in thread
From: Alan Cox @ 2009-03-31 21:13 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Linus Torvalds, Ric Wheeler, Andreas T.Auer, Theodore Tso,
	Mark Lord, Stefan Richter, Jeff Garzik, Matthew Garrett,
	Andrew Morton, David Rees, Jesper Krogh,
	Linux Kernel Mailing List

> No argument there.  I have seen NCQ starvation on SATA disks,
> with some requests sitting in the drive for seconds, while
> the drive was busy handling hundreds of requests/second
> elsewhere...

The really sad thing about that one is that the SCSI vendors had this
problem over ten years ago with TCQ - and fixed it in the drives.


^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: [PATCH 2/7] ext3: call blkdev_issue_flush() on fsync()
  2009-03-31 11:18                                                             ` Jens Axboe
@ 2009-03-31 21:29                                                               ` Jeff Garzik
  2009-04-01  1:03                                                                 ` Tejun Heo
  0 siblings, 1 reply; 664+ messages in thread
From: Jeff Garzik @ 2009-03-31 21:29 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Tejun Heo, Theodore Tso, Chris Mason,
	Fernando Luis Vázquez Cao, Christoph Hellwig,
	Linus Torvalds, Ingo Molnar, Alan Cox, Arjan van de Ven,
	Andrew Morton, Peter Zijlstra, Nick Piggin, David Rees,
	Jesper Krogh, Linux Kernel Mailing List, david

Jens Axboe wrote:
> On Tue, Mar 31 2009, Tejun Heo wrote:
>> Hello,
>>
>> Theodore Tso wrote:
>>> On Mon, Mar 30, 2009 at 10:15:51AM -0400, Chris Mason wrote:
>>>> I'm not sure we want to stick Fernando with changing how barriers are
>>>> done in individual filesystems, his patch is just changing the existing
>>>> call points.
>>> Well, his patch actually added some calls to block_issue_flush().  But
>>> yes, it's probably better if he just changes the existing call points,
>>> and we can have the relevant filesystem maintainers double check to
>>> make sure that there aren't any new call points which are needed.
>> How about having something like blk_ensure_cache_flushed() which
>> issues flush iff there hasn't been any write since the last flush?
>> It'll be easy to implement and will filter out duplicate flushes in
>> most cases.
> 
> My original ide implementation of flushes actually did this. My memory
> is a little hazy on why it was dropped, I'm guessing because it
> basically never triggered anyway.

Yeah, and it probably wouldn't trigger today unless we add new code that 
starts generating enough duplicate cache flushes for this to be 
significant...

And since duplicate cache flushes are harmless to the drive, you're only 
talking about no-op ATA command overhead.  Which is only mildly notable 
on legacy IDE (eight or so inb/outb operations).

I would put duplicate cache flush filtering way, way down on the 
priority list, IMO.

	Jeff





^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-28 21:55                                           ` Bojan Smojver
@ 2009-03-31 21:51                                             ` Jeremy Fitzhardinge
  2009-03-31 22:30                                               ` Bojan Smojver
  0 siblings, 1 reply; 664+ messages in thread
From: Jeremy Fitzhardinge @ 2009-03-31 21:51 UTC (permalink / raw)
  To: Bojan Smojver; +Cc: linux-kernel

Bojan Smojver wrote:
> Bojan Smojver <bojan <at> rexursive.com> writes:
>
>   
>> That was stupid. Ignore me.
>>     
>
> And yet, FreeBSD seems to have a command just like that:
>
> http://www.freebsd.org/cgi/man.cgi?query=fsync&sektion=1&manpath=FreeBSD+7.1-RELEASE
>   

I was thinking something like "munge_important_stuff | fsync > output" - 
ie, cat which fsyncs on close.  In fact, its vaguely surprising that GNU 
cat doesn't have this already.

    J

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-31 21:10                                                                                               ` Alan Cox
@ 2009-03-31 21:55                                                                                                 ` Linus Torvalds
  0 siblings, 0 replies; 664+ messages in thread
From: Linus Torvalds @ 2009-03-31 21:55 UTC (permalink / raw)
  To: Alan Cox
  Cc: Ric Wheeler, Andreas T.Auer, Theodore Tso, Mark Lord,
	Stefan Richter, Jeff Garzik, Matthew Garrett, Andrew Morton,
	David Rees, Jesper Krogh, Linux Kernel Mailing List



On Tue, 31 Mar 2009, Alan Cox wrote:
> 
> How about the far more regular crash case ? We may be pretty reliable but
> we are hardly indestructible especially on random boxes with funky BIOSes
> or low grade hardware builds.

The regular crash case doesn't need to care about the disk write-cache AT 
ALL. The disk will finish the writes on its own long after the kernel 
crashed.

That was my _point_. The write cache on the disk is generally a whole lot 
safer than the OS data cache. If there's a catastrophic software failure 
(outside of the disk firmware itself ;), then the OS data cache is gone. 
But the disk write cache will be written back.

Of course, if you have an automatic and immediate "power-off-on-oops", 
you're screwed, but if so, you have bigger problems anyway. You need to 
wait at _least_ a second or two before you power off.

		Linus

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-31 20:49                             ` Jeff Garzik
@ 2009-03-31 22:02                               ` Ric Wheeler
  2009-03-31 22:22                                 ` Jeff Garzik
  0 siblings, 1 reply; 664+ messages in thread
From: Ric Wheeler @ 2009-03-31 22:02 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Jens Axboe, Linus Torvalds, Theodore Tso, Ingo Molnar, Alan Cox,
	Arjan van de Ven, Andrew Morton, Peter Zijlstra, Nick Piggin,
	David Rees, Jesper Krogh, Linux Kernel Mailing List

Jeff Garzik wrote:
> Jens Axboe wrote:
>> Another problem is that FLUSH_CACHE sucks. Really. And not just on
>> ext3/ordered, generally. Write a 50 byte file, fsync, flush cache and
>> wit for the world to finish. Pretty hard to teach people to use a nicer
>> fdatasync(), when the majority of the cost now becomes flushing the
>> cache of that 1TB drive you happen to have 8 partitions on. Good luck
>> with that.
>
> (responding to an email way back near the start of the thread)
>
> I emailed Microsoft about their proposal to add a WRITE BARRIER 
> command to ATA, documented at
> http://www.t13.org/Documents/UploadedDocuments/docs2007/e07174r0-Write_Barrier_Command_Proposal.doc 
>
>
> The MSFT engineer said they were definitely still pursuing this proposal.
>
> IMO we could look at this too, or perhaps come up with an alternate 
> proposal like FLUSH CACHE RANGE(s).
>
>     Jeff
>

I agree that it is worth getting better mechanisms in place - the cache 
flush is really primitive. Now we just need a victim to sit in on 
T13/T10 standards meetings :-)

ric


^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-31 22:02                               ` Ric Wheeler
@ 2009-03-31 22:22                                 ` Jeff Garzik
  2009-04-01 18:34                                   ` Mark Lord
  0 siblings, 1 reply; 664+ messages in thread
From: Jeff Garzik @ 2009-03-31 22:22 UTC (permalink / raw)
  To: Ric Wheeler
  Cc: Jens Axboe, Linus Torvalds, Theodore Tso, Ingo Molnar, Alan Cox,
	Arjan van de Ven, Andrew Morton, Mark Lord,
	Linux Kernel Mailing List, Linux IDE mailing list

Ric Wheeler wrote:
> Jeff Garzik wrote:
>> Jens Axboe wrote:
>>> Another problem is that FLUSH_CACHE sucks. Really. And not just on
>>> ext3/ordered, generally. Write a 50 byte file, fsync, flush cache and
>>> wit for the world to finish. Pretty hard to teach people to use a nicer
>>> fdatasync(), when the majority of the cost now becomes flushing the
>>> cache of that 1TB drive you happen to have 8 partitions on. Good luck
>>> with that.
>>
>> (responding to an email way back near the start of the thread)
>>
>> I emailed Microsoft about their proposal to add a WRITE BARRIER 
>> command to ATA, documented at
>> http://www.t13.org/Documents/UploadedDocuments/docs2007/e07174r0-Write_Barrier_Command_Proposal.doc 

>> The MSFT engineer said they were definitely still pursuing this proposal.
>>
>> IMO we could look at this too, or perhaps come up with an alternate 
>> proposal like FLUSH CACHE RANGE(s).

> I agree that it is worth getting better mechanisms in place - the cache 
> flush is really primitive. Now we just need a victim to sit in on 
> T13/T10 standards meetings :-)


Heck, we could even do a prototype implementation with the help of Mark 
Lord's sata_mv target mode support...

	Jeff




^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-31 21:51                                             ` Jeremy Fitzhardinge
@ 2009-03-31 22:30                                               ` Bojan Smojver
  2009-04-01  5:26                                                 ` Bojan Smojver
  0 siblings, 1 reply; 664+ messages in thread
From: Bojan Smojver @ 2009-03-31 22:30 UTC (permalink / raw)
  To: Jeremy Fitzhardinge; +Cc: linux-kernel

On Tue, 2009-03-31 at 14:51 -0700, Jeremy Fitzhardinge wrote:
> I was thinking something like "munge_important_stuff | fsync > output"
> - ie, cat which fsyncs on close.

Yeah, after I wrote my initial comment, I noticed you were saying
essentially the same thing in your original post. I know, I should
_read_ before posting. Sorry :-(

> In fact, its vaguely surprising that GNU  cat doesn't have this
> already.

I have no idea why we don't have that either. FreeBSD code seems really
straightforward.

-- 
Bojan


^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: [PATCH 6/7] xfs: propagate issue-flush error code
  2009-03-30 12:33                                                   ` [PATCH 6/7] xfs: propagate issue-flush error code Fernando Luis Vázquez Cao
  2009-03-30 15:20                                                     ` Bartlomiej Zolnierkiewicz
@ 2009-03-31 23:37                                                     ` Dave Chinner
  2009-04-01  3:52                                                       ` Fernando Luis Vázquez Cao
  1 sibling, 1 reply; 664+ messages in thread
From: Dave Chinner @ 2009-03-31 23:37 UTC (permalink / raw)
  To: Fernando Luis Vázquez Cao
  Cc: Jeff Garzik, Christoph Hellwig, Linus Torvalds, Theodore Tso,
	Ingo Molnar, Alan Cox, Arjan van de Ven, Andrew Morton,
	Peter Zijlstra, Nick Piggin, David Rees, Jesper Krogh,
	Linux Kernel Mailing List, chris.mason, tj, bzolnier

On Mon, Mar 30, 2009 at 09:33:14PM +0900, Fernando Luis Vázquez Cao wrote:
> blkdev_issue_flush() may fail (i.e. due to media error on FLUSH CACHE
> command execution) so its users should check for the return value.
>
> (This issues was first spotted Bartlomiej Zolnierkiewicz)
>
> Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
> Signed-off-by: Fernando Luis Vazquez Cao <fernando@oss.ntt.co.jp>

I think this patch is unnecessary as well as being broken.


> diff -urNp linux-2.6.29-orig/fs/xfs/xfs_vnodeops.c linux-2.6.29/fs/xfs/xfs_vnodeops.c
> --- linux-2.6.29-orig/fs/xfs/xfs_vnodeops.c	2009-03-24 08:12:14.000000000 +0900
> +++ linux-2.6.29/fs/xfs/xfs_vnodeops.c	2009-03-30 15:08:21.000000000 +0900
> @@ -678,20 +678,20 @@ xfs_fsync(
>  		xfs_iunlock(ip, XFS_ILOCK_EXCL);
>  	}
>
> -	if ((ip->i_mount->m_flags & XFS_MOUNT_BARRIER) && changed) {
> +	if (!error && (ip->i_mount->m_flags & XFS_MOUNT_BARRIER) && changed) {

That is wrong. Even if there was a error, we still need to
flush the device if it hasn't already been done.

>  		/*
>  		 * If the log write didn't issue an ordered tag we need
>  		 * to flush the disk cache for the data device now.
>  		 */
>  		if (!log_flushed)
> -			xfs_blkdev_issue_flush(ip->i_mount->m_ddev_targp);
> +			error = xfs_blkdev_issue_flush(ip->i_mount->m_ddev_targp);

What happens if we get an EOPNOTSUPP here?
That is a meaningless error to return to fsync()....


>  		/*
>  		 * If this inode is on the RT dev we need to flush that
>  		 * cache as well.
>  		 */
> -		if (XFS_IS_REALTIME_INODE(ip))
> -			xfs_blkdev_issue_flush(ip->i_mount->m_rtdev_targp);
> +		if (!error && XFS_IS_REALTIME_INODE(ip))
> +			error = xfs_blkdev_issue_flush(ip->i_mount->m_rtdev_targp);

That is broken, too. The realtime device is a different device,
so always should be flushed regardless of the return from the
log device.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-30 12:55                                                                               ` Chris Mason
  2009-03-30 17:42                                                                                 ` Theodore Tso
@ 2009-03-31 23:55                                                                                 ` Dave Chinner
  2009-04-01 12:53                                                                                   ` Chris Mason
  1 sibling, 1 reply; 664+ messages in thread
From: Dave Chinner @ 2009-03-31 23:55 UTC (permalink / raw)
  To: Chris Mason
  Cc: Mark Lord, Stefan Richter, Jeff Garzik, Linus Torvalds,
	Matthew Garrett, Alan Cox, Theodore Tso, Andrew Morton,
	David Rees, Jesper Krogh, Linux Kernel Mailing List

On Mon, Mar 30, 2009 at 08:55:51AM -0400, Chris Mason wrote:
> On Mon, 2009-03-30 at 10:14 +1100, Dave Chinner wrote:
> > On Sat, Mar 28, 2009 at 11:17:08AM -0400, Mark Lord wrote:
> > > The better solution seems to be the rather obvious one:
> > >
> > >   the filesystem should commit data to disk before altering metadata.
> > 
> > Generalities are bad. For example:
> > 
> > write();
> > unlink();
> > <do more stuff>
> > close();
> > 
> > This is a clear case where you want metadata changed before data is
> > committed to disk. In many cases, you don't even want the data to
> > hit the disk here.
> > 
> > Similarly, rsync does the magic open,write,close,rename sequence
> > without an fsync before the rename. And it doesn't need the fsync,
> > either. The proposed implicit fsync on rename will kill rsync
> > performance, and I think that may make many people unhappy....
> > 
> 
> Sorry, I'm afraid that rsync falls into the same category as the
> kde/gnome apps here.

I disagree.

> There are a lot of backup programs built around rsync, and every one of
> them risks losing the old copy of the file by renaming an unflushed new
> copy over it.

If you crash while rsync is running, then the state of the copy
is garbage anyway. You have to restart from scratch and rsync will
detect such failures and resync the file. gnome/kde have no
mechanism for such recovery.

> rsync needs the flushing about a million times more than gnome and kde,
> and it doesn't have any option to do it automatically.

And therein lies the problem with a "flush-before-rename"
semantic....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: range-based cache flushing (was Re: Linux 2.6.29)
  2009-03-30 19:05                                     ` range-based cache flushing (was Re: Linux 2.6.29) Jeff Garzik
@ 2009-04-01  0:14                                       ` James Bottomley
  2009-04-01  1:28                                         ` Jeff Garzik
  0 siblings, 1 reply; 664+ messages in thread
From: James Bottomley @ 2009-04-01  0:14 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Ric Wheeler, Jens Axboe, Linus Torvalds, Theodore Tso,
	Ingo Molnar, Alan Cox, Arjan van de Ven, Andrew Morton,
	Peter Zijlstra, Nick Piggin, David Rees, Jesper Krogh,
	Linux Kernel Mailing List

On Mon, 2009-03-30 at 15:05 -0400, Jeff Garzik wrote:
> James Bottomley wrote:
> > On Wed, 2009-03-25 at 16:25 -0400, Ric Wheeler wrote:
> >> Jeff Garzik wrote:
> >>> Ric Wheeler wrote:> And, as I am sure that you do know, to add insult 
> >>> to injury, FLUSH_CACHE
> >>>> is per device (not file system).
> 
> >>>> When you issue an fsync() on a disk with multiple partitions, you 
> >>>> will flush the data for all of its partitions from the write cache....
> >>> SCSI'S SYNCHRONIZE CACHE command already accepts an (LBA, length) 
> >>> pair.  We could make use of that.
> 
> >>> And I bet we could convince T13 to add FLUSH CACHE RANGE, if we could 
> >>> demonstrate clear benefit.
> 
> >> How well supported is this in SCSI?  Can we try it out with a commodity 
> >> SAS drive?
> 
> > What do you mean by well supported?  The way the SCSI standard is
> > written, a device can do a complete cache flush when a range flush is
> > requested and still be fully standards compliant.  There's no easy way
> > to tell if it does a complete cache flush every time other than by
> > taking the firmware apart (or asking the manufacturer).
> 
> Quite true, though wondering aloud...
> 
> How difficult would it be to pass the "lower-bound" LBA to SYNCHRONIZE 
> CACHE, where "lower bound" is defined as the lowest sector in the range 
> of sectors to be flushed?

Actually, the implementation is designed to allow this.  The standard
says if the number of blocks is zero that means flush from the specified
LBA to the end of the device.  The sync cache we currently use has LBA 0
and number of blocks zero (which means flush everything).

> That seems like a reasonable optimization -- it gives the drive an easy 
> way to skip sync'ing sectors lower than the lower-bound LBA, if it is 
> capable.  Otherwise, a standards-compliant firmware will behave as you 
> describe, and do what our code currently expects today -- a full cache 
> flush.
> 
> This seems like a good way to speed up cache flush [on SCSI], while also 
> perhaps experimenting with a more fine-grained way to pass down write 
> barriers to the device.
> 
> Not a high priority thing overall, but OTOH, consider the case of 
> placing your journal at the end of the disk.  You could then issue a 
> cache flush with a non-zero starting offset:
> 
> 	SYNCHRONIZE CACHE (max sectors - JOURNAL_SIZE, ~0)
> 
> That should be trivial even for dumb disk firmwares to optimize.

We could try it ... I'm still not sure how we'd tell the device is
actually implementing it and not flushing the entire device.

James



^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: [PATCH 1/7] block: Add block_flush_device()
  2009-03-31 17:03                                                                                 ` Jens Axboe
@ 2009-04-01  0:43                                                                                   ` Tejun Heo
  0 siblings, 0 replies; 664+ messages in thread
From: Tejun Heo @ 2009-04-01  0:43 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Linus Torvalds, Ric Wheeler, Fernando Luis Vázquez Cao,
	Jeff Garzik, Christoph Hellwig, Theodore Tso, Ingo Molnar,
	Alan Cox, Arjan van de Ven, Andrew Morton, Peter Zijlstra,
	Nick Piggin, David Rees, Jesper Krogh, Linux Kernel Mailing List,
	chris.mason, david

Jens Axboe wrote:
> On Tue, Mar 31 2009, Jens Axboe wrote:
>> On Tue, Mar 31 2009, Linus Torvalds wrote:
>>>
>>> On Tue, 31 Mar 2009, Ric Wheeler wrote:
>>>> The question is really what we do when you have a storage device in your box
>>>> with a volatile write cache that does support flush or fua or similar.
>>> Ok. Then you are talking about a different case - not EOPNOTSUPP.
>> So here's a test patch that attempts to just ignore such a failure to
>> flush the caches. It will still flag the bio as BIO_EOPNOTSUPP, but
>> that's merely maintaining the information in case the caller does want
>> to see if that barrier failed or not. It may not actually be useful, in
>> which case we can just kill that flag.
> 
> Updated version, the previous missed most of the buffer_eopnotsupp()
> checking. So this one also gets rid of the file system retry logic.
> Thanks to gfs2 Steve for pointing out that I missed gfs2, made me
> realize that I missed a lot more as well.

Wouldn't it be cleaner to simply finish with success status from
blk_do_ordered()?  That is the single place that all flush/barrier ops
go through and semantically better place too.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: [PATCH 1/7] block: Add block_flush_device()
  2009-03-31 17:19                                                                                   ` Jens Axboe
@ 2009-04-01  0:54                                                                                     ` Tejun Heo
  0 siblings, 0 replies; 664+ messages in thread
From: Tejun Heo @ 2009-04-01  0:54 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Linus Torvalds, Ric Wheeler, Fernando Luis Vázquez Cao,
	Jeff Garzik, Christoph Hellwig, Theodore Tso, Ingo Molnar,
	Alan Cox, Arjan van de Ven, Andrew Morton, Peter Zijlstra,
	Nick Piggin, David Rees, Jesper Krogh, Linux Kernel Mailing List,
	chris.mason, david

Hello,

Jens Axboe wrote:
>> For example, if the drive doesn't support FUA, then you cannot do a 
>> serialized IO operation, but you can still mostly do a serialized op 
>> without any IO attached to it.
> 
> FUA we should be able to reliably detect, it's really the cache flush
> operation itself that has caused headaches in the past. The -EOPNOTSUPP
> really comes from the block layer, not from the device driver. That's
> mainly due to the fact that we only send down the actual barrier, if the
> driver already said it supported them. If they do fail them, we probably
> need to pick up the -EIO bits and pieces and pretend it didn't happen as
> well. So it definitely needs more looking into, auditing, and testing.
> I'll do that tomorrow.

Yeah, we need to implement some kind of fallback logic such that
filesystems get errors iff the underlying device actually failed to
flush.  For the most part, this shouldn't be too difficult.

There is a corner case for tag ordered requests in that retrying
might end up putting the barrier on the platter after writes following
it.  Well, the problem isn't specific to fallback tho.  The root
problem is that later command get issued before the previous ones are
finished and SCSI ordered tag doesn't mandate failure of earlier
request to abort all the following ones, so by the time block layer
knows about the failure, writes after the barrier might already be on
the platter.  I guess we'll have to ignore that for the time being.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: [PATCH 2/7] ext3: call blkdev_issue_flush() on fsync()
  2009-03-31 21:29                                                               ` Jeff Garzik
@ 2009-04-01  1:03                                                                 ` Tejun Heo
  0 siblings, 0 replies; 664+ messages in thread
From: Tejun Heo @ 2009-04-01  1:03 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Jens Axboe, Theodore Tso, Chris Mason,
	Fernando Luis Vázquez Cao, Christoph Hellwig,
	Linus Torvalds, Ingo Molnar, Alan Cox, Arjan van de Ven,
	Andrew Morton, Peter Zijlstra, Nick Piggin, David Rees,
	Jesper Krogh, Linux Kernel Mailing List, david

Jeff Garzik wrote:
> Jens Axboe wrote:
>> On Tue, Mar 31 2009, Tejun Heo wrote:
>>> Hello,
>>>
>>> Theodore Tso wrote:
>>>> On Mon, Mar 30, 2009 at 10:15:51AM -0400, Chris Mason wrote:
>>>>> I'm not sure we want to stick Fernando with changing how barriers are
>>>>> done in individual filesystems, his patch is just changing the
>>>>> existing
>>>>> call points.
>>>> Well, his patch actually added some calls to block_issue_flush().  But
>>>> yes, it's probably better if he just changes the existing call points,
>>>> and we can have the relevant filesystem maintainers double check to
>>>> make sure that there aren't any new call points which are needed.
>>> How about having something like blk_ensure_cache_flushed() which
>>> issues flush iff there hasn't been any write since the last flush?
>>> It'll be easy to implement and will filter out duplicate flushes in
>>> most cases.
>>
>> My original ide implementation of flushes actually did this. My memory
>> is a little hazy on why it was dropped, I'm guessing because it
>> basically never triggered anyway.
> 
> Yeah, and it probably wouldn't trigger today unless we add new code that
> starts generating enough duplicate cache flushes for this to be
> significant...

Well, the thread was about adding such a call, so...

> And since duplicate cache flushes are harmless to the drive, you're only
> talking about no-op ATA command overhead.  Which is only mildly notable
> on legacy IDE (eight or so inb/outb operations).
> 
> I would put duplicate cache flush filtering way, way down on the
> priority list, IMO.

Yeap, unless FS guys need it, there's no reason to push it.
Although having dup flush detection Theodore described (w/ callstack
saving at issue time) would be nice for debugging.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-30 19:02                                                                                   ` Bill Davidsen
@ 2009-04-01  1:19                                                                                     ` david
  2009-04-01 16:24                                                                                       ` Bill Davidsen
  0 siblings, 1 reply; 664+ messages in thread
From: david @ 2009-04-01  1:19 UTC (permalink / raw)
  To: Bill Davidsen; +Cc: linux-kernel

On Mon, 30 Mar 2009, Bill Davidsen wrote:

> Andreas T.Auer wrote:
>> On 30.03.2009 02:39 Theodore Tso wrote:
>>> All I can do is apologize to all other filesystem developers profusely
>>> for ext3's data=ordered semantics; at this point, I very much regret
>>> that we made data=ordered the default for ext3.  But the application
>>> writers vastly outnumber us, and realistically we're not going to be
>>> able to easily roll back eight years of application writers being
>>> trained that fsync() is not necessary, and actually is detrimental for
>>> ext3.
>
>> And still I don't know any reason, why it makes sense to write the
>> metadata to non-existing data immediately instead of delaying that, too.
>> 
> Here I have the same question, I don't expect or demand that anything be done 
> in a particular order unless I force it so, and I expect there to be some 
> corner case where the data is written and the metadata doesn't reflect that 
> in the event of a failure, but I can't see that it ever a good idea to have 
> the metadata reflect the future and describe what things will look like if 
> everything goes as planned. I have had enough of that BS from financial 
> planners and politicians, metadata shouldn't try to predict the future just 
> to save a ms here or there. It's also necessary to have the metadata match 
> reality after fsync(), of course, or even the well behaved applications 
> mentioned in this thread haven't a hope of staying consistent.
>
> Feel free to clarify why clairvoyant metadata is ever a good thing...

it's not that it's deliberatly pushing metadata out ahead of file data, 
but say you have the following sequence

write to file1
update metadata for file1
write to file2
update metadata for file2

if file1 and file2 are in the same directory your software can finish all 
four of these steps before _any_ of the data gets pushed to disk.

then when the system goes to write the metadata for file1 it is pushing 
the then-current copy of that sector to disk, which includes the metadata 
for file2, even though the data for file2 hasn't been written yet.

if you try to say 'flush all data blocks before metadata blocks' and have 
a lot of activity going on in a directory, and have to wait until it all 
stops before you write any of the metadata out, you could be blocked from 
writing the metadata for a _long_ time.

Also, if somone does a fsync on any of those files you can end up waiting 
a long time for all that other data to get written out (especially if the 
files are still being modified while you are trying to do the fsync). As I 
understand it, this is the fundamental cause of the slow fsync calls on 
ext3 with data=ordered.

David Lang

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: range-based cache flushing (was Re: Linux 2.6.29)
  2009-04-01  0:14                                       ` James Bottomley
@ 2009-04-01  1:28                                         ` Jeff Garzik
  2009-04-01 21:20                                           ` James Bottomley
  0 siblings, 1 reply; 664+ messages in thread
From: Jeff Garzik @ 2009-04-01  1:28 UTC (permalink / raw)
  To: James Bottomley
  Cc: Ric Wheeler, Jens Axboe, Linus Torvalds, Theodore Tso,
	Ingo Molnar, Alan Cox, Arjan van de Ven, Andrew Morton,
	Peter Zijlstra, Nick Piggin, David Rees, Jesper Krogh,
	Linux Kernel Mailing List

James Bottomley wrote:
> On Mon, 2009-03-30 at 15:05 -0400, Jeff Garzik wrote:
>> James Bottomley wrote:
>>> On Wed, 2009-03-25 at 16:25 -0400, Ric Wheeler wrote:
>>>> Jeff Garzik wrote:
>>>>> Ric Wheeler wrote:> And, as I am sure that you do know, to add insult 
>>>>> to injury, FLUSH_CACHE
>>>>>> is per device (not file system).
>>>>>> When you issue an fsync() on a disk with multiple partitions, you 
>>>>>> will flush the data for all of its partitions from the write cache....
>>>>> SCSI'S SYNCHRONIZE CACHE command already accepts an (LBA, length) 
>>>>> pair.  We could make use of that.
>>>>> And I bet we could convince T13 to add FLUSH CACHE RANGE, if we could 
>>>>> demonstrate clear benefit.
>>>> How well supported is this in SCSI?  Can we try it out with a commodity 
>>>> SAS drive?
>>> What do you mean by well supported?  The way the SCSI standard is
>>> written, a device can do a complete cache flush when a range flush is
>>> requested and still be fully standards compliant.  There's no easy way
>>> to tell if it does a complete cache flush every time other than by
>>> taking the firmware apart (or asking the manufacturer).
>> Quite true, though wondering aloud...
>>
>> How difficult would it be to pass the "lower-bound" LBA to SYNCHRONIZE 
>> CACHE, where "lower bound" is defined as the lowest sector in the range 
>> of sectors to be flushed?
> 
> Actually, the implementation is designed to allow this.  The standard
> says if the number of blocks is zero that means flush from the specified
> LBA to the end of the device.  The sync cache we currently use has LBA 0
> and number of blocks zero (which means flush everything).

Yeah, that feature of the spec was what got me thinking.

"difficult" was referring more to the kernel side of things...  if 
calculating the lowest LBA of a write barrier is difficult and/or 
CPU-consuming, the effort may not be worth it.

But if we could stick a

	if (LBA < barrier-lower-bound)
		barrier-lower-bound = LBA

somewhere, then pass that to SYNCHRONIZE CACHE, it could be a cheap way 
to increase sync-cache speed.

It seems extremely unlikely that sync-cache speed would _decrease_:  for 
flush-everything firmwares, the sync-cache speed would remain unchanged.


>> That seems like a reasonable optimization -- it gives the drive an easy 
>> way to skip sync'ing sectors lower than the lower-bound LBA, if it is 
>> capable.  Otherwise, a standards-compliant firmware will behave as you 
>> describe, and do what our code currently expects today -- a full cache 
>> flush.
>>
>> This seems like a good way to speed up cache flush [on SCSI], while also 
>> perhaps experimenting with a more fine-grained way to pass down write 
>> barriers to the device.
>>
>> Not a high priority thing overall, but OTOH, consider the case of 
>> placing your journal at the end of the disk.  You could then issue a 
>> cache flush with a non-zero starting offset:
>>
>> 	SYNCHRONIZE CACHE (max sectors - JOURNAL_SIZE, ~0)
>>
>> That should be trivial even for dumb disk firmwares to optimize.
> 
> We could try it ... I'm still not sure how we'd tell the device is
> actually implementing it and not flushing the entire device.

Is that knowledge necessary?

Assuming the lower-bound is super-cheap to calculate, then the two most 
likely outcomes are:  sync-cache speed remains the same, or sync-cache 
speed increases.

If the calculation of lower-bound is costly, I could see the need for 
that knowledge -- but if the cost is too high, the entire effort it 
likely to be scuttled, rather than worrying about detecting 
flush-everything firmwares.

	Jeff





^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: [PATCH 6/7] xfs: propagate issue-flush error code
  2009-03-31 23:37                                                     ` Dave Chinner
@ 2009-04-01  3:52                                                       ` Fernando Luis Vázquez Cao
  0 siblings, 0 replies; 664+ messages in thread
From: Fernando Luis Vázquez Cao @ 2009-04-01  3:52 UTC (permalink / raw)
  To: Fernando Luis Vázquez Cao, Jeff Garzik, Christoph Hellwig,
	Linus Torvalds, Theodore Tso, Ingo Molnar, Alan Cox,
	Arjan van de Ven, Andrew Morton, Peter Zijlstra, Nick Piggin,
	David Rees, Jesper Krogh, Linux Kernel Mailing List, chris.mason,
	tj, bzolnier

Dave Chinner wrote:
> On Mon, Mar 30, 2009 at 09:33:14PM +0900, Fernando Luis Vázquez Cao wrote:
>> blkdev_issue_flush() may fail (i.e. due to media error on FLUSH CACHE
>> command execution) so its users should check for the return value.
>>
>> (This issues was first spotted Bartlomiej Zolnierkiewicz)
>>
>> Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
>> Signed-off-by: Fernando Luis Vazquez Cao <fernando@oss.ntt.co.jp>
> 
> I think this patch is unnecessary as well as being broken.
> 
> 
>> diff -urNp linux-2.6.29-orig/fs/xfs/xfs_vnodeops.c linux-2.6.29/fs/xfs/xfs_vnodeops.c
>> --- linux-2.6.29-orig/fs/xfs/xfs_vnodeops.c	2009-03-24 08:12:14.000000000 +0900
>> +++ linux-2.6.29/fs/xfs/xfs_vnodeops.c	2009-03-30 15:08:21.000000000 +0900
>> @@ -678,20 +678,20 @@ xfs_fsync(
>>  		xfs_iunlock(ip, XFS_ILOCK_EXCL);
>>  	}
>>
>> -	if ((ip->i_mount->m_flags & XFS_MOUNT_BARRIER) && changed) {
>> +	if (!error && (ip->i_mount->m_flags & XFS_MOUNT_BARRIER) && changed) {
> 
> That is wrong. Even if there was a error, we still need to
> flush the device if it hasn't already been done.

If any of the previous writes failed there is no way to know what we are actually
flushing. When we know things went awry I do not see the point in flushing the
device since part of the data we were trying to sync might not have made it to
the device.

Anyway this is a minor nitpick/policy issue that can be easily reverted to keep
the previous behavior.

>>  		/*
>>  		 * If the log write didn't issue an ordered tag we need
>>  		 * to flush the disk cache for the data device now.
>>  		 */
>>  		if (!log_flushed)
>> -			xfs_blkdev_issue_flush(ip->i_mount->m_ddev_targp);
>> +			error = xfs_blkdev_issue_flush(ip->i_mount->m_ddev_targp);
> 
> What happens if we get an EOPNOTSUPP here?
> That is a meaningless error to return to fsync()....

Please look at the code again. xfs_blkdev_issue_flush() calls blkdev_issue_flush()
which turns EOPNOTSUPP into 0 to hide that error from filesystems. It is the
non-EOPNOTSUPP errors that XFS should handle: the underlying device may support
write cache flushes and still fail to flush (due to hardware errors)!

This patch is an attempt to fix the current situation.

>>  		/*
>>  		 * If this inode is on the RT dev we need to flush that
>>  		 * cache as well.
>>  		 */
>> -		if (XFS_IS_REALTIME_INODE(ip))
>> -			xfs_blkdev_issue_flush(ip->i_mount->m_rtdev_targp);
>> +		if (!error && XFS_IS_REALTIME_INODE(ip))
>> +			error = xfs_blkdev_issue_flush(ip->i_mount->m_rtdev_targp);
> 
> That is broken, too. The realtime device is a different device,
> so always should be flushed regardless of the return from the
> log device.

Does it still make sense when writes to the log have failed?

Thanks!

- Fernando

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-31 22:30                                               ` Bojan Smojver
@ 2009-04-01  5:26                                                 ` Bojan Smojver
  2009-04-01  6:35                                                   ` Jeremy Fitzhardinge
  0 siblings, 1 reply; 664+ messages in thread
From: Bojan Smojver @ 2009-04-01  5:26 UTC (permalink / raw)
  To: Jeremy Fitzhardinge; +Cc: linux-kernel

On Wed, 2009-04-01 at 09:30 +1100, Bojan Smojver wrote:
> I have no idea why we don't have that either. FreeBSD code seems
> really straightforward.

I just tried using dd with conv=fsync option and that kinda does what
you mentioned. I see this at the end of strace:
---------------------------------
write(1, "<some data...>"..., 512) = 512
read(0, ""..., 512)                = 0
fsync(1)                           = 0
close(0)                           = 0
close(1)                           = 0
---------------------------------

So, maybe GNU folks just don't want to have yet another tool for this.

-- 
Bojan


^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-04-01  5:26                                                 ` Bojan Smojver
@ 2009-04-01  6:35                                                   ` Jeremy Fitzhardinge
  0 siblings, 0 replies; 664+ messages in thread
From: Jeremy Fitzhardinge @ 2009-04-01  6:35 UTC (permalink / raw)
  To: Bojan Smojver; +Cc: linux-kernel

Bojan Smojver wrote:
> On Wed, 2009-04-01 at 09:30 +1100, Bojan Smojver wrote:
>   
>> I have no idea why we don't have that either. FreeBSD code seems
>> really straightforward.
>>     
>
> I just tried using dd with conv=fsync option and that kinda does what
> you mentioned. I see this at the end of strace:
> ---------------------------------
> write(1, "<some data...>"..., 512) = 512
> read(0, ""..., 512)                = 0
> fsync(1)                           = 0
> close(0)                           = 0
> close(1)                           = 0
> ---------------------------------
>
> So, maybe GNU folks just don't want to have yet another tool for this.
>   

Huh, didn't know dd had grown that.  Confusingly similar to the 
completely different conv=sync, so its a perfect dd addition.  Ooh, 
fdatasync too.

    J

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-31 23:55                                                                                 ` Dave Chinner
@ 2009-04-01 12:53                                                                                   ` Chris Mason
  2009-04-01 15:41                                                                                     ` Andreas T.Auer
  0 siblings, 1 reply; 664+ messages in thread
From: Chris Mason @ 2009-04-01 12:53 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Mark Lord, Stefan Richter, Jeff Garzik, Linus Torvalds,
	Matthew Garrett, Alan Cox, Theodore Tso, Andrew Morton,
	David Rees, Jesper Krogh, Linux Kernel Mailing List

On Wed, 2009-04-01 at 10:55 +1100, Dave Chinner wrote:
> On Mon, Mar 30, 2009 at 08:55:51AM -0400, Chris Mason wrote:
> > On Mon, 2009-03-30 at 10:14 +1100, Dave Chinner wrote:
> > > On Sat, Mar 28, 2009 at 11:17:08AM -0400, Mark Lord wrote:
> > > > The better solution seems to be the rather obvious one:
> > > >
> > > >   the filesystem should commit data to disk before altering metadata.
> > > 
> > > Generalities are bad. For example:
> > > 
> > > write();
> > > unlink();
> > > <do more stuff>
> > > close();
> > > 
> > > This is a clear case where you want metadata changed before data is
> > > committed to disk. In many cases, you don't even want the data to
> > > hit the disk here.
> > > 
> > > Similarly, rsync does the magic open,write,close,rename sequence
> > > without an fsync before the rename. And it doesn't need the fsync,
> > > either. The proposed implicit fsync on rename will kill rsync
> > > performance, and I think that may make many people unhappy....
> > > 
> > 
> > Sorry, I'm afraid that rsync falls into the same category as the
> > kde/gnome apps here.
> 
> I disagree.
>
> > There are a lot of backup programs built around rsync, and every one of
> > them risks losing the old copy of the file by renaming an unflushed new
> > copy over it.
> 
> If you crash while rsync is running, then the state of the copy
> is garbage anyway. You have to restart from scratch and rsync will
> detect such failures and resync the file. gnome/kde have no
> mechanism for such recovery.
> 

If this were the recovery system they had in mind, then why use rename
at all?  They could just as easily overwrite the original in place.
Using rename implies they want to replace the old with a complete new
version.

There's also the window where you crash after the rsync is done but
before all the new data safely makes it into the replacement files.

> > rsync needs the flushing about a million times more than gnome and kde,
> > and it doesn't have any option to do it automatically.
> 
> And therein lies the problem with a "flush-before-rename"
> semantic....

Here I was just talking about a rsync --flush-after-rename or something,
not an option from the kernel.

-chris



^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-04-01 12:53                                                                                   ` Chris Mason
@ 2009-04-01 15:41                                                                                     ` Andreas T.Auer
  2009-04-01 16:02                                                                                       ` Chris Mason
  0 siblings, 1 reply; 664+ messages in thread
From: Andreas T.Auer @ 2009-04-01 15:41 UTC (permalink / raw)
  To: Chris Mason
  Cc: Dave Chinner, Mark Lord, Stefan Richter, Jeff Garzik,
	Linus Torvalds, Matthew Garrett, Alan Cox, Theodore Tso,
	Andrew Morton, David Rees, Jesper Krogh,
	Linux Kernel Mailing List



On 01.04.2009 14:53 Chris Mason wrote:
> On Wed, 2009-04-01 at 10:55 +1100, Dave Chinner wrote:
>   
>> If you crash while rsync is running, then the state of the copy
>> is garbage anyway. You have to restart from scratch and rsync will
>> detect such failures and resync the file. gnome/kde have no
>> mechanism for such recovery.
>>     
> If this were the recovery system they had in mind, then why use rename
> at all?  They could just as easily overwrite the original in place.
>   

It is not a recovery system.  The renaming procedure is almost atomic
with e.g. reiser or ext3 (ordered), but simple overwriting would always
leave a window between truncating and the complete rewrite of the file.

> Using rename implies they want to replace the old with a complete new
> version.
>
> There's also the window where you crash after the rsync is done but
> before all the new data safely makes it into the replacement files.
>   

Sure, but in that case you have only lost some of your _mirrored_ data.
The original will usually be untouched by this. So after the restart you
just start the mirroring process again, and hopefully, this time you get
a perfect copy.

In KDE and lots of other apps the _original_ config files (and not any
copies) are "overlinked" with the new files by the rename. That's the
difference.

Andreas


^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-04-01 15:41                                                                                     ` Andreas T.Auer
@ 2009-04-01 16:02                                                                                       ` Chris Mason
  2009-04-01 18:37                                                                                         ` Andreas T.Auer
  2009-04-01 21:50                                                                                         ` Theodore Tso
  0 siblings, 2 replies; 664+ messages in thread
From: Chris Mason @ 2009-04-01 16:02 UTC (permalink / raw)
  To: Andreas T.Auer
  Cc: Dave Chinner, Mark Lord, Stefan Richter, Jeff Garzik,
	Linus Torvalds, Matthew Garrett, Alan Cox, Theodore Tso,
	Andrew Morton, David Rees, Jesper Krogh,
	Linux Kernel Mailing List

On Wed, 2009-04-01 at 17:41 +0200, Andreas T.Auer wrote:
> 
> On 01.04.2009 14:53 Chris Mason wrote:
> > On Wed, 2009-04-01 at 10:55 +1100, Dave Chinner wrote:
> >   
> >> If you crash while rsync is running, then the state of the copy
> >> is garbage anyway. You have to restart from scratch and rsync will
> >> detect such failures and resync the file. gnome/kde have no
> >> mechanism for such recovery.
> >>     
> > If this were the recovery system they had in mind, then why use rename
> > at all?  They could just as easily overwrite the original in place.
> >   
> 
> It is not a recovery system.  The renaming procedure is almost atomic
> with e.g. reiser or ext3 (ordered), but simple overwriting would always
> leave a window between truncating and the complete rewrite of the file.
> 

Well, we're considering a future where ext3 and reiser are no longer
used, and applications are responsible for the flushing if they want
renames atomic for data as well as metadata.

In this case, rename without additional flush and truncate are the same.

> > Using rename implies they want to replace the old with a complete new
> > version.
> >
> > There's also the window where you crash after the rsync is done but
> > before all the new data safely makes it into the replacement files.
> >   
> 
> Sure, but in that case you have only lost some of your _mirrored_ data.
> The original will usually be untouched by this. So after the restart you
> just start the mirroring process again, and hopefully, this time you get
> a perfect copy.
> 

If we crash during the rsync, the backup logs will yell.  If we crash
just after the rsync, the backup logs won't know.  The data could still
be gone.

> In KDE and lots of other apps the _original_ config files (and not any
> copies) are "overlinked" with the new files by the rename. That's the
> difference.

We don't run backup programs because we can use the original as a backup
for the backup ;)  From an rsync-for-backup point of view, the backup is
the only copy.

Yes, rsync could easily be fixed.  Or maybe people just aren't worried,
its hard to say.  Having the ext3 style flush with the rename makes the
system easier to use, and easier to predict how it will react.

rsync was originally brought up when someone asked about applications
that do renames and don't care about atomic data replacement.  If the
flushing is a horrible thing, there must be a lot more examples?

-chris



^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-04-01  1:19                                                                                     ` david
@ 2009-04-01 16:24                                                                                       ` Bill Davidsen
  2009-04-01 20:15                                                                                         ` david
  0 siblings, 1 reply; 664+ messages in thread
From: Bill Davidsen @ 2009-04-01 16:24 UTC (permalink / raw)
  To: david; +Cc: linux-kernel

david@lang.hm wrote:
> On Mon, 30 Mar 2009, Bill Davidsen wrote:
>
>> Andreas T.Auer wrote:
>>> On 30.03.2009 02:39 Theodore Tso wrote:
>>>> All I can do is apologize to all other filesystem developers profusely
>>>> for ext3's data=ordered semantics; at this point, I very much regret
>>>> that we made data=ordered the default for ext3.  But the application
>>>> writers vastly outnumber us, and realistically we're not going to be
>>>> able to easily roll back eight years of application writers being
>>>> trained that fsync() is not necessary, and actually is detrimental for
>>>> ext3.
>>
>>> And still I don't know any reason, why it makes sense to write the
>>> metadata to non-existing data immediately instead of delaying that, 
>>> too.
>>>
>> Here I have the same question, I don't expect or demand that anything 
>> be done in a particular order unless I force it so, and I expect 
>> there to be some corner case where the data is written and the 
>> metadata doesn't reflect that in the event of a failure, but I can't 
>> see that it ever a good idea to have the metadata reflect the future 
>> and describe what things will look like if everything goes as 
>> planned. I have had enough of that BS from financial planners and 
>> politicians, metadata shouldn't try to predict the future just to 
>> save a ms here or there. It's also necessary to have the metadata 
>> match reality after fsync(), of course, or even the well behaved 
>> applications mentioned in this thread haven't a hope of staying 
>> consistent.
>>
>> Feel free to clarify why clairvoyant metadata is ever a good thing...
>
> it's not that it's deliberatly pushing metadata out ahead of file 
> data, but say you have the following sequence
>
> write to file1
> update metadata for file1
> write to file2
> update metadata for file2
>
Understood that it's not deliberate just careless. The two behaviors 
which are reported are (a) updating a record in an existing file and 
having the entire file content vanish, and (b) finding some one else's 
old data in my file - a serious security issue. I haven't seen any 
report of the case where a process unlinks or truncates a file, the disk 
space gets reused, and then the systems fails before the metadata is 
updated, leaving the data written by some other process in the file 
where it can be read - another possible security issue.

> if file1 and file2 are in the same directory your software can finish 
> all four of these steps before _any_ of the data gets pushed to disk.
>
> then when the system goes to write the metadata for file1 it is 
> pushing the then-current copy of that sector to disk, which includes 
> the metadata for file2, even though the data for file2 hasn't been 
> written yet.
>
> if you try to say 'flush all data blocks before metadata blocks' and 
> have a lot of activity going on in a directory, and have to wait until 
> it all stops before you write any of the metadata out, you could be 
> blocked from writing the metadata for a _long_ time.
>
If you mean "write all data for that file" before the metadata, it would 
seem to behave the way an fsync would, and the metadata should go out in 
some reasonable time.
> Also, if somone does a fsync on any of those files you can end up 
> waiting a long time for all that other data to get written out 
> (especially if the files are still being modified while you are trying 
> to do the fsync). As I understand it, this is the fundamental cause of 
> the slow fsync calls on ext3 with data=ordered.

Your analysis sounds right to me,

-- 
bill davidsen <davidsen@tmr.com>
  CTO TMR Associates, Inc

"You are disgraced professional losers. And by the way, give us our money back."
    - Representative Earl Pomeroy,  Democrat of North Dakota
on the A.I.G. executives who were paid bonuses  after a federal bailout.



^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-31 22:22                                 ` Jeff Garzik
@ 2009-04-01 18:34                                   ` Mark Lord
  0 siblings, 0 replies; 664+ messages in thread
From: Mark Lord @ 2009-04-01 18:34 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Ric Wheeler, Jens Axboe, Linus Torvalds, Theodore Tso,
	Ingo Molnar, Alan Cox, Arjan van de Ven, Andrew Morton,
	Mark Lord, Linux Kernel Mailing List, Linux IDE mailing list

Jeff Garzik wrote:
> Ric Wheeler wrote:
>> Jeff Garzik wrote:
>>> Jens Axboe wrote:
..
>>> IMO we could look at this too, or perhaps come up with an alternate 
>>> proposal like FLUSH CACHE RANGE(s).
> 
>> I agree that it is worth getting better mechanisms in place - the 
>> cache flush is really primitive. Now we just need a victim to sit in 
>> on T13/T10 standards meetings :-)
> 
> 
> Heck, we could even do a prototype implementation with the help of Mark 
> Lord's sata_mv target mode support...
..

Speaking of which.. you probably won't see the preliminary rev
of sata_mv + target_mode until sometime this weekend.

It's going to be something quite simple for 2.6.30,
and we can expand on that in later kernels.

Cheers

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-04-01 16:02                                                                                       ` Chris Mason
@ 2009-04-01 18:37                                                                                         ` Andreas T.Auer
  2009-04-01 21:50                                                                                         ` Theodore Tso
  1 sibling, 0 replies; 664+ messages in thread
From: Andreas T.Auer @ 2009-04-01 18:37 UTC (permalink / raw)
  To: Chris Mason
  Cc: Andreas T.Auer, Dave Chinner, Mark Lord, Stefan Richter,
	Jeff Garzik, Linus Torvalds, Matthew Garrett, Alan Cox,
	Theodore Tso, Andrew Morton, David Rees, Jesper Krogh,
	Linux Kernel Mailing List



On 01.04.2009 18:02 Chris Mason wrote:
> On Wed, 2009-04-01 at 17:41 +0200, Andreas T.Auer wrote:
>   
>> On 01.04.2009 14:53 Chris Mason wrote:
>>     
>> It is not a recovery system.  The renaming procedure is almost atomic
>> with e.g. reiser or ext3 (ordered), but simple overwriting would always
>> leave a window between truncating and the complete rewrite of the file.
>>
>>     
>
> Well, we're considering a future where ext3 and reiser are no longer
> used, and applications are responsible for the flushing if they want
> renames atomic for data as well as metadata.
>   
As long as you only consider it, all will be fine ;-). As a user I don't
want to use a filesystem which leaves a long gap between renaming the
metadata and writing the data for it, that is having dirty, inconsistent
metadata overwriting clean metadata. So Ted's quick pragmatic approach
to patch it in the first step was good, even if it's possible that it's
not be the final solution.

Flushing in applications is not a suitable solution. Maybe barriers
could be a solution, but to get something like this into _all_ the
multitude of applications is very unlikely.
There might be filesystems which use a delayed, but ordered mode. They
could provide "atomic" renames, and perform much better, if applications
do not flush with every file update.

Andreas


^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-04-01 16:24                                                                                       ` Bill Davidsen
@ 2009-04-01 20:15                                                                                         ` david
  2009-04-01 21:33                                                                                           ` Andreas T.Auer
  2009-04-01 22:00                                                                                           ` Harald Arnesen
  0 siblings, 2 replies; 664+ messages in thread
From: david @ 2009-04-01 20:15 UTC (permalink / raw)
  To: Bill Davidsen; +Cc: linux-kernel

On Wed, 1 Apr 2009, Bill Davidsen wrote:

> david@lang.hm wrote:
>> On Mon, 30 Mar 2009, Bill Davidsen wrote:
>> 
>>> Andreas T.Auer wrote:
>>>> On 30.03.2009 02:39 Theodore Tso wrote:
>>>>> All I can do is apologize to all other filesystem developers profusely
>>>>> for ext3's data=ordered semantics; at this point, I very much regret
>>>>> that we made data=ordered the default for ext3.  But the application
>>>>> writers vastly outnumber us, and realistically we're not going to be
>>>>> able to easily roll back eight years of application writers being
>>>>> trained that fsync() is not necessary, and actually is detrimental for
>>>>> ext3.
>>> 
>>>> And still I don't know any reason, why it makes sense to write the
>>>> metadata to non-existing data immediately instead of delaying that, too.
>>>> 
>>> Here I have the same question, I don't expect or demand that anything be 
>>> done in a particular order unless I force it so, and I expect there to be 
>>> some corner case where the data is written and the metadata doesn't 
>>> reflect that in the event of a failure, but I can't see that it ever a 
>>> good idea to have the metadata reflect the future and describe what things 
>>> will look like if everything goes as planned. I have had enough of that BS 
>>> from financial planners and politicians, metadata shouldn't try to predict 
>>> the future just to save a ms here or there. It's also necessary to have 
>>> the metadata match reality after fsync(), of course, or even the well 
>>> behaved applications mentioned in this thread haven't a hope of staying 
>>> consistent.
>>> 
>>> Feel free to clarify why clairvoyant metadata is ever a good thing...
>> 
>> it's not that it's deliberatly pushing metadata out ahead of file data, but 
>> say you have the following sequence
>> 
>> write to file1
>> update metadata for file1
>> write to file2
>> update metadata for file2
>> 
> Understood that it's not deliberate just careless. The two behaviors which 
> are reported are (a) updating a record in an existing file and having the 
> entire file content vanish, and (b) finding some one else's old data in my 
> file - a serious security issue. I haven't seen any report of the case where 
> a process unlinks or truncates a file, the disk space gets reused, and then 
> the systems fails before the metadata is updated, leaving the data written by 
> some other process in the file where it can be read - another possible 
> security issue.

ext3 eliminates this security issue by writing the data before the 
metadata. ext4 (and I thing XFS) eliminate this security issue by not 
allocating the blocks until it goes to write the data out. I don't know 
how other filesystems deal with this.

>> if file1 and file2 are in the same directory your software can finish all 
>> four of these steps before _any_ of the data gets pushed to disk.
>> 
>> then when the system goes to write the metadata for file1 it is pushing the 
>> then-current copy of that sector to disk, which includes the metadata for 
>> file2, even though the data for file2 hasn't been written yet.
>> 
>> if you try to say 'flush all data blocks before metadata blocks' and have a 
>> lot of activity going on in a directory, and have to wait until it all 
>> stops before you write any of the metadata out, you could be blocked from 
>> writing the metadata for a _long_ time.
>> 
> If you mean "write all data for that file" before the metadata, it would seem 
> to behave the way an fsync would, and the metadata should go out in some 
> reasonable time.

except if another file in the directory gets modified while it's writing 
out the first two, that file now would need to get written out as well, 
before the metadata for that directory can be written. if you have a busy 
system (say a database or log server), where files are getting modified 
pretty constantly, it can be a long time before all the file data is 
written out and the system is idle enough to write the metadata.

David Lang

>> Also, if somone does a fsync on any of those files you can end up waiting a 
>> long time for all that other data to get written out (especially if the 
>> files are still being modified while you are trying to do the fsync). As I 
>> understand it, this is the fundamental cause of the slow fsync calls on 
>> ext3 with data=ordered.
>
> Your analysis sounds right to me,
>
>

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-27  1:25                         ` Andrew Morton
                                             ` (3 preceding siblings ...)
  2009-03-28  5:06                           ` Ingo Molnar
@ 2009-04-01 21:03                           ` Lennart Sorensen
  2009-04-01 21:36                             ` Andrew Morton
                                               ` (2 more replies)
  4 siblings, 3 replies; 664+ messages in thread
From: Lennart Sorensen @ 2009-04-01 21:03 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, Theodore Tso, David Rees, Jesper Krogh,
	Linux Kernel Mailing List

On Thu, Mar 26, 2009 at 06:25:19PM -0700, Andrew Morton wrote:
> The JBD journal is a massive designed-in contention point.  It's why
> for several years I've been telling anyone who will listen that we need
> a new fs.  Hopefully our response to all these problems will soon be
> "did you try btrfs?".

Oh I look forward to the day when it will be safe to convert my mythtv
box from ext3 to btrfs.  Current kernels just have too much IO latency
with ext3 it seems.  Older kernels were more responsive, but probably
had other places they were less efficient.

-- 
Len Sorensen

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: range-based cache flushing (was Re: Linux 2.6.29)
  2009-04-01  1:28                                         ` Jeff Garzik
@ 2009-04-01 21:20                                           ` James Bottomley
  0 siblings, 0 replies; 664+ messages in thread
From: James Bottomley @ 2009-04-01 21:20 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Ric Wheeler, Jens Axboe, Linus Torvalds, Theodore Tso,
	Ingo Molnar, Alan Cox, Arjan van de Ven, Andrew Morton,
	Peter Zijlstra, Nick Piggin, David Rees, Jesper Krogh,
	Linux Kernel Mailing List

On Tue, 2009-03-31 at 21:28 -0400, Jeff Garzik wrote:
> James Bottomley wrote:
> > On Mon, 2009-03-30 at 15:05 -0400, Jeff Garzik wrote:
> >> James Bottomley wrote:
> >>> On Wed, 2009-03-25 at 16:25 -0400, Ric Wheeler wrote:
> >>>> Jeff Garzik wrote:
> >>>>> Ric Wheeler wrote:> And, as I am sure that you do know, to add insult 
> >>>>> to injury, FLUSH_CACHE
> >>>>>> is per device (not file system).
> >>>>>> When you issue an fsync() on a disk with multiple partitions, you 
> >>>>>> will flush the data for all of its partitions from the write cache....
> >>>>> SCSI'S SYNCHRONIZE CACHE command already accepts an (LBA, length) 
> >>>>> pair.  We could make use of that.
> >>>>> And I bet we could convince T13 to add FLUSH CACHE RANGE, if we could 
> >>>>> demonstrate clear benefit.
> >>>> How well supported is this in SCSI?  Can we try it out with a commodity 
> >>>> SAS drive?
> >>> What do you mean by well supported?  The way the SCSI standard is
> >>> written, a device can do a complete cache flush when a range flush is
> >>> requested and still be fully standards compliant.  There's no easy way
> >>> to tell if it does a complete cache flush every time other than by
> >>> taking the firmware apart (or asking the manufacturer).
> >> Quite true, though wondering aloud...
> >>
> >> How difficult would it be to pass the "lower-bound" LBA to SYNCHRONIZE 
> >> CACHE, where "lower bound" is defined as the lowest sector in the range 
> >> of sectors to be flushed?
> > 
> > Actually, the implementation is designed to allow this.  The standard
> > says if the number of blocks is zero that means flush from the specified
> > LBA to the end of the device.  The sync cache we currently use has LBA 0
> > and number of blocks zero (which means flush everything).
> 
> Yeah, that feature of the spec was what got me thinking.
> 
> "difficult" was referring more to the kernel side of things...  if 
> calculating the lowest LBA of a write barrier is difficult and/or 
> CPU-consuming, the effort may not be worth it.
> 
> But if we could stick a
> 
> 	if (LBA < barrier-lower-bound)
> 		barrier-lower-bound = LBA
> 
> somewhere, then pass that to SYNCHRONIZE CACHE, it could be a cheap way 
> to increase sync-cache speed.
> 
> It seems extremely unlikely that sync-cache speed would _decrease_:  for 
> flush-everything firmwares, the sync-cache speed would remain unchanged.

It's not impossible, though ... since the drive fw processor is probably
pretty slow, but yes, it should hopefully be as fast or faster than full
sync cache.

> >> That seems like a reasonable optimization -- it gives the drive an easy 
> >> way to skip sync'ing sectors lower than the lower-bound LBA, if it is 
> >> capable.  Otherwise, a standards-compliant firmware will behave as you 
> >> describe, and do what our code currently expects today -- a full cache 
> >> flush.
> >>
> >> This seems like a good way to speed up cache flush [on SCSI], while also 
> >> perhaps experimenting with a more fine-grained way to pass down write 
> >> barriers to the device.
> >>
> >> Not a high priority thing overall, but OTOH, consider the case of 
> >> placing your journal at the end of the disk.  You could then issue a 
> >> cache flush with a non-zero starting offset:
> >>
> >> 	SYNCHRONIZE CACHE (max sectors - JOURNAL_SIZE, ~0)
> >>
> >> That should be trivial even for dumb disk firmwares to optimize.
> > 
> > We could try it ... I'm still not sure how we'd tell the device is
> > actually implementing it and not flushing the entire device.
> 
> Is that knowledge necessary?
> 
> Assuming the lower-bound is super-cheap to calculate, then the two most 
> likely outcomes are:  sync-cache speed remains the same, or sync-cache 
> speed increases.

Yes, agreed ... we might as well tell the FW if it's cheap to know
whether or not it acts on it.

> If the calculation of lower-bound is costly, I could see the need for 
> that knowledge -- but if the cost is too high, the entire effort it 
> likely to be scuttled, rather than worrying about detecting 
> flush-everything firmwares.

I really think, though, it's time to look again at how we implement
barriers.  Even properly implemented range flushing (if we can do it) is
only decreasing the amount of overhead in a flush barrier.

If we could make the filesystems tolerant or at least aware that there
might be very rare periods during operation when barriers get violated
(during error processing or queue full handling) we could look again at
implementing barriers via ordered tags.

James



^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-04-01 20:15                                                                                         ` david
@ 2009-04-01 21:33                                                                                           ` Andreas T.Auer
  2009-04-01 22:29                                                                                             ` david
  2009-04-01 22:00                                                                                           ` Harald Arnesen
  1 sibling, 1 reply; 664+ messages in thread
From: Andreas T.Auer @ 2009-04-01 21:33 UTC (permalink / raw)
  To: david; +Cc: Bill Davidsen, linux-kernel



On 01.04.2009 22:15 david@lang.hm wrote:
> On Wed, 1 Apr 2009, Bill Davidsen wrote:
>
>> david@lang.hm wrote:
>>> it's not that it's deliberatly pushing metadata out ahead of file
>>> data, but say you have the following sequence
>>>
>>> write to file1
>>> update metadata for file1
>>> write to file2
>>> update metadata for file2
>>>
>>> if file1 and file2 are in the same directory your software can
>>> finish all four of these steps before _any_ of the data gets pushed
>>> to disk.
>>>
>>> then when the system goes to write the metadata for file1 it is
>>> pushing the then-current copy of that sector to disk, which includes
>>> the metadata for file2, even though the data for file2 hasn't been
>>> written yet.
>>>
>>> if you try to say 'flush all data blocks before metadata blocks' and
>>> have a lot of activity going on in a directory, and have to wait
>>> until it all stops before you write any of the metadata out, you
>>> could be blocked from writing the metadata for a _long_ time.
>>>
>> If you mean "write all data for that file" before the metadata, it
>> would seem to behave the way an fsync would, and the metadata should
>> go out in some reasonable time.
>
> except if another file in the directory gets modified while it's
> writing out the first two, that file now would need to get written out
> as well, before the metadata for that directory can be written. if you
> have a busy system (say a database or log server), where files are
> getting modified pretty constantly, it can be a long time before all
> the file data is written out and the system is idle enough to write
> the metadata.
Thank you, David, for this use case, but I think the problem could be
solved quite easily:

At any write-out time, e.g. after collecting enough data for delayed
allocation or at fsync()

1) copy the metadata in memory, i.e. snapshot it
2) write out the data corresponding to the metadata-snapshot
3) write out the snapshot of the metadata

In that way subsequent metadata changes should not interfere with the
metadata-update on disk.

Andreas

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-04-01 21:03                           ` Lennart Sorensen
@ 2009-04-01 21:36                             ` Andrew Morton
  2009-04-01 22:57                               ` Lennart Sorensen
  2009-04-02  1:00                               ` Ingo Molnar
  2009-04-02 11:05                             ` Janne Grunau
  2009-04-02 12:17                             ` Theodore Tso
  2 siblings, 2 replies; 664+ messages in thread
From: Andrew Morton @ 2009-04-01 21:36 UTC (permalink / raw)
  To: Lennart Sorensen; +Cc: torvalds, tytso, drees76, jesper, linux-kernel

On Wed, 1 Apr 2009 17:03:38 -0400
lsorense@csclub.uwaterloo.ca (Lennart Sorensen) wrote:

> On Thu, Mar 26, 2009 at 06:25:19PM -0700, Andrew Morton wrote:
> > The JBD journal is a massive designed-in contention point.  It's why
> > for several years I've been telling anyone who will listen that we need
> > a new fs.  Hopefully our response to all these problems will soon be
> > "did you try btrfs?".
> 
> Oh I look forward to the day when it will be safe to convert my mythtv
> box from ext3 to btrfs.  Current kernels just have too much IO latency
> with ext3 it seems.  Older kernels were more responsive, but probably
> had other places they were less efficient.

Back in 2002ish I did a *lot* of work on IO latency, reads-vs-writes,
etc, etc (but not fsync - for practical purposes it's unfixable on
ext3-ordered)

Performance was pretty good.  From some of the descriptions I'm seeing
get tossed around lately, I suspect that it has regressed.

It would be useful/interesting if people were to rerun some of these
tests with `echo anticipatory > /sys/block/sda/queue/scheduler'.

Or with linux-2.5.60 :(


^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-04-01 16:02                                                                                       ` Chris Mason
  2009-04-01 18:37                                                                                         ` Andreas T.Auer
@ 2009-04-01 21:50                                                                                         ` Theodore Tso
  2009-04-01 23:44                                                                                           ` Matthew Garrett
  1 sibling, 1 reply; 664+ messages in thread
From: Theodore Tso @ 2009-04-01 21:50 UTC (permalink / raw)
  To: Chris Mason
  Cc: Andreas T.Auer, Dave Chinner, Mark Lord, Stefan Richter,
	Jeff Garzik, Linus Torvalds, Matthew Garrett, Alan Cox,
	Andrew Morton, David Rees, Jesper Krogh,
	Linux Kernel Mailing List

On Wed, Apr 01, 2009 at 12:02:26PM -0400, Chris Mason wrote:
> 
> If we crash during the rsync, the backup logs will yell.  If we crash
> just after the rsync, the backup logs won't know.  The data could still
> be gone.

So have rsync call the sync() system call before it exits.  Not a big
deal, and not all that costly.  So basically what I would suggest
doing for people who are really worried about rsync performance with
flush-on-rename is to create a patch to rsync which creates a new
flag, --unlink-before-rename, which will defeat the flush-on-rename
hueristic; and if this patch also causes rsync to call sync() when it
is done, it should be quite safe.

						- Ted

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-04-01 20:15                                                                                         ` david
  2009-04-01 21:33                                                                                           ` Andreas T.Auer
@ 2009-04-01 22:00                                                                                           ` Harald Arnesen
  2009-04-01 22:09                                                                                             ` Alejandro Riveira Fernández
  2009-04-01 22:28                                                                                             ` david
  1 sibling, 2 replies; 664+ messages in thread
From: Harald Arnesen @ 2009-04-01 22:00 UTC (permalink / raw)
  To: david; +Cc: Bill Davidsen, linux-kernel

david@lang.hm writes:

>> Understood that it's not deliberate just careless. The two behaviors
>> which are reported are (a) updating a record in an existing file and
>> having the entire file content vanish, and (b) finding some one
>> else's old data in my file - a serious security issue. I haven't
>> seen any report of the case where a process unlinks or truncates a
>> file, the disk space gets reused, and then the systems fails before
>> the metadata is updated, leaving the data written by some other
>> process in the file where it can be read - another possible security
>> issue.
>
> ext3 eliminates this security issue by writing the data before the
> metadata. ext4 (and I thing XFS) eliminate this security issue by not
> allocating the blocks until it goes to write the data out. I don't
> know how other filesystems deal with this.

I've been wondering about that during the last days. How abut JFS and
data loss (files containing zeroes after a crash), as compared to ext3,
ext4, ordered and writeback journal modes? Is is safe?
-- 
Hilsen Harald.

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-04-01 22:00                                                                                           ` Harald Arnesen
@ 2009-04-01 22:09                                                                                             ` Alejandro Riveira Fernández
  2009-04-01 22:28                                                                                             ` david
  1 sibling, 0 replies; 664+ messages in thread
From: Alejandro Riveira Fernández @ 2009-04-01 22:09 UTC (permalink / raw)
  To: Harald Arnesen; +Cc: david, Bill Davidsen, linux-kernel

El Thu, 02 Apr 2009 00:00:04 +0200
Harald Arnesen <skogtun.harald@gmail.com> escribió:


> 
> I've been wondering about that during the last days. How abut JFS and
> data loss (files containing zeroes after a crash), as compared to ext3,
> ext4, ordered and writeback journal modes? Is is safe?

  i have had zeroed conf files with jfs (shell history) and corrupted firefox
 history files too after power outages and the like.


^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-04-01 22:00                                                                                           ` Harald Arnesen
  2009-04-01 22:09                                                                                             ` Alejandro Riveira Fernández
@ 2009-04-01 22:28                                                                                             ` david
  1 sibling, 0 replies; 664+ messages in thread
From: david @ 2009-04-01 22:28 UTC (permalink / raw)
  To: Harald Arnesen; +Cc: Bill Davidsen, linux-kernel

On Thu, 2 Apr 2009, Harald Arnesen wrote:

> david@lang.hm writes:
>
>>> Understood that it's not deliberate just careless. The two behaviors
>>> which are reported are (a) updating a record in an existing file and
>>> having the entire file content vanish, and (b) finding some one
>>> else's old data in my file - a serious security issue. I haven't
>>> seen any report of the case where a process unlinks or truncates a
>>> file, the disk space gets reused, and then the systems fails before
>>> the metadata is updated, leaving the data written by some other
>>> process in the file where it can be read - another possible security
>>> issue.
>>
>> ext3 eliminates this security issue by writing the data before the
>> metadata. ext4 (and I thing XFS) eliminate this security issue by not
>> allocating the blocks until it goes to write the data out. I don't
>> know how other filesystems deal with this.
>
> I've been wondering about that during the last days. How abut JFS and
> data loss (files containing zeroes after a crash), as compared to ext3,
> ext4, ordered and writeback journal modes? Is is safe?

if you don't do a fsync you can (and will) loose data if there is a crash

period, end of statement, with all filesystems

for all filesystems except ext3 in data=ordered or data=journaled modes 
journaling does _not_ mean that your files will have valid data in them. 
all it means is that your metadata will not be inconsistant (things like 
one block on disk showing up as being part of two different files)

this guarantee means that a crash is not likely to scramble your entire 
disk, but any data written shortly before the crash may not have made it 
to disk (and the files may contain garbage in the space that was allocated 
but not written). as such it is not nessasary to do a fsck after every 
crash (it's still a good idea to do so every once in a while)

that's _ALL_ that journaling is protecting you from.

delayed allocateion and data=ordered are ways to address the security 
problem that the garbage data that could end up as part of the file could 
contain sensitive data that had been part of other files in the past.

data=ordered and data=journaled address this security risk by writing the 
data before they write the metadata (at the cost of long delays in writing 
the metadata out, and therefor long fsync times)

XFS and ext4 solve the problem by not allocating the data blocks until 
they are actually ready to write the data.

David Lang


^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-04-01 21:33                                                                                           ` Andreas T.Auer
@ 2009-04-01 22:29                                                                                             ` david
  2009-04-02  2:30                                                                                               ` Bron Gondwana
  2009-04-02 12:30                                                                                               ` Bill Davidsen
  0 siblings, 2 replies; 664+ messages in thread
From: david @ 2009-04-01 22:29 UTC (permalink / raw)
  To: Andreas T.Auer; +Cc: Bill Davidsen, linux-kernel

On Wed, 1 Apr 2009, Andreas T.Auer wrote:

> On 01.04.2009 22:15 david@lang.hm wrote:
>> On Wed, 1 Apr 2009, Bill Davidsen wrote:
>>
>>> david@lang.hm wrote:
>>>> it's not that it's deliberatly pushing metadata out ahead of file
>>>> data, but say you have the following sequence
>>>>
>>>> write to file1
>>>> update metadata for file1
>>>> write to file2
>>>> update metadata for file2
>>>>
>>>> if file1 and file2 are in the same directory your software can
>>>> finish all four of these steps before _any_ of the data gets pushed
>>>> to disk.
>>>>
>>>> then when the system goes to write the metadata for file1 it is
>>>> pushing the then-current copy of that sector to disk, which includes
>>>> the metadata for file2, even though the data for file2 hasn't been
>>>> written yet.
>>>>
>>>> if you try to say 'flush all data blocks before metadata blocks' and
>>>> have a lot of activity going on in a directory, and have to wait
>>>> until it all stops before you write any of the metadata out, you
>>>> could be blocked from writing the metadata for a _long_ time.
>>>>
>>> If you mean "write all data for that file" before the metadata, it
>>> would seem to behave the way an fsync would, and the metadata should
>>> go out in some reasonable time.
>>
>> except if another file in the directory gets modified while it's
>> writing out the first two, that file now would need to get written out
>> as well, before the metadata for that directory can be written. if you
>> have a busy system (say a database or log server), where files are
>> getting modified pretty constantly, it can be a long time before all
>> the file data is written out and the system is idle enough to write
>> the metadata.
> Thank you, David, for this use case, but I think the problem could be
> solved quite easily:
>
> At any write-out time, e.g. after collecting enough data for delayed
> allocation or at fsync()
>
> 1) copy the metadata in memory, i.e. snapshot it
> 2) write out the data corresponding to the metadata-snapshot
> 3) write out the snapshot of the metadata
>
> In that way subsequent metadata changes should not interfere with the
> metadata-update on disk.

the problem with this approach is that the dcache has no provision for 
there being two (or more) copies of the disk block in it's cache, adding 
this would significantly complicate things (it was mentioned briefly a few 
days ago in this thread)

David Lang

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-04-01 21:36                             ` Andrew Morton
@ 2009-04-01 22:57                               ` Lennart Sorensen
  2009-04-03 14:46                                 ` Mark Lord
  2009-04-02  1:00                               ` Ingo Molnar
  1 sibling, 1 reply; 664+ messages in thread
From: Lennart Sorensen @ 2009-04-01 22:57 UTC (permalink / raw)
  To: Andrew Morton; +Cc: torvalds, tytso, drees76, jesper, linux-kernel

On Wed, Apr 01, 2009 at 02:36:22PM -0700, Andrew Morton wrote:
> Back in 2002ish I did a *lot* of work on IO latency, reads-vs-writes,
> etc, etc (but not fsync - for practical purposes it's unfixable on
> ext3-ordered)
> 
> Performance was pretty good.  From some of the descriptions I'm seeing
> get tossed around lately, I suspect that it has regressed.
> 
> It would be useful/interesting if people were to rerun some of these
> tests with `echo anticipatory > /sys/block/sda/queue/scheduler'.
> 
> Or with linux-2.5.60 :(

Well 2.6.18 seems to keep popping up as the last kernel with "sane"
behaviour, at least in terms of not causing huge delays under many
workloads.  I currently run 2.6.26, although that could be updated as
soon as I get around to figuring out why lirc isn't working for me when
I move past 2.6.26.

I could certainly try changing the scheduler on my mythtv box and seeing
if that makes any difference to the behaviour.  It is pretty darn obvious
whether it is responsive or not when starting to play back a video.

-- 
Len Sorensen

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-04-01 21:50                                                                                         ` Theodore Tso
@ 2009-04-01 23:44                                                                                           ` Matthew Garrett
  0 siblings, 0 replies; 664+ messages in thread
From: Matthew Garrett @ 2009-04-01 23:44 UTC (permalink / raw)
  To: Theodore Tso, Chris Mason, Andreas T.Auer, Dave Chinner,
	Mark Lord, Stefan Richter, Jeff Garzik, Linus Torvalds, Alan Cox,
	Andrew Morton, David Rees, Jesper Krogh,
	Linux Kernel Mailing List

On Wed, Apr 01, 2009 at 05:50:40PM -0400, Theodore Tso wrote:
> On Wed, Apr 01, 2009 at 12:02:26PM -0400, Chris Mason wrote:
> > 
> > If we crash during the rsync, the backup logs will yell.  If we crash
> > just after the rsync, the backup logs won't know.  The data could still
> > be gone.
> 
> So have rsync call the sync() system call before it exits.

sync() isn't guaranteed to be synchronous. Treating it as such isn't 
portable.
-- 
Matthew Garrett | mjg59@srcf.ucam.org

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-04-01 21:36                             ` Andrew Morton
  2009-04-01 22:57                               ` Lennart Sorensen
@ 2009-04-02  1:00                               ` Ingo Molnar
  2009-04-03  4:06                                 ` Lennart Sorensen
  1 sibling, 1 reply; 664+ messages in thread
From: Ingo Molnar @ 2009-04-02  1:00 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Lennart Sorensen, torvalds, tytso, drees76, jesper, linux-kernel


* Andrew Morton <akpm@linux-foundation.org> wrote:

> On Wed, 1 Apr 2009 17:03:38 -0400
> lsorense@csclub.uwaterloo.ca (Lennart Sorensen) wrote:
> 
> > On Thu, Mar 26, 2009 at 06:25:19PM -0700, Andrew Morton wrote:
> > > The JBD journal is a massive designed-in contention point.  It's why
> > > for several years I've been telling anyone who will listen that we need
> > > a new fs.  Hopefully our response to all these problems will soon be
> > > "did you try btrfs?".
> > 
> > Oh I look forward to the day when it will be safe to convert my mythtv
> > box from ext3 to btrfs.  Current kernels just have too much IO latency
> > with ext3 it seems.  Older kernels were more responsive, but probably
> > had other places they were less efficient.
> 
> Back in 2002ish I did a *lot* of work on IO latency, 
> reads-vs-writes, etc, etc (but not fsync - for practical purposes 
> it's unfixable on ext3-ordered)
> 
> Performance was pretty good.  From some of the descriptions I'm 
> seeing get tossed around lately, I suspect that it has regressed.
> 
> It would be useful/interesting if people were to rerun some of these
> tests with `echo anticipatory > /sys/block/sda/queue/scheduler'.

I'll test this (and the other suggestions) once i'm out of the merge 
window.

> Or with linux-2.5.60 :(

I probably wont test that though ;-)

Going back to v2.6.14 to do pre-mutex-merge performance tests was 
already quite a challenge on modern hardware.

	Ingo

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-04-01 22:29                                                                                             ` david
@ 2009-04-02  2:30                                                                                               ` Bron Gondwana
  2009-04-02  4:55                                                                                                 ` david
  2009-04-02 12:30                                                                                               ` Bill Davidsen
  1 sibling, 1 reply; 664+ messages in thread
From: Bron Gondwana @ 2009-04-02  2:30 UTC (permalink / raw)
  To: david; +Cc: Andreas T.Auer, Bill Davidsen, linux-kernel

On Wed, Apr 01, 2009 at 03:29:29PM -0700, david@lang.hm wrote:
> On Wed, 1 Apr 2009, Andreas T.Auer wrote:
>> On 01.04.2009 22:15 david@lang.hm wrote:
>>> except if another file in the directory gets modified while it's
>>> writing out the first two, that file now would need to get written out
>>> as well, before the metadata for that directory can be written. if you
>>> have a busy system (say a database or log server), where files are
>>> getting modified pretty constantly, it can be a long time before all
>>> the file data is written out and the system is idle enough to write
>>> the metadata.
>> Thank you, David, for this use case, but I think the problem could be
>> solved quite easily:
>>
>> At any write-out time, e.g. after collecting enough data for delayed
>> allocation or at fsync()
>>
>> 1) copy the metadata in memory, i.e. snapshot it
>> 2) write out the data corresponding to the metadata-snapshot
>> 3) write out the snapshot of the metadata
>>
>> In that way subsequent metadata changes should not interfere with the
>> metadata-update on disk.
>
> the problem with this approach is that the dcache has no provision for  
> there being two (or more) copies of the disk block in it's cache, adding  
> this would significantly complicate things (it was mentioned briefly a 
> few days ago in this thread)

It seems that it's obviously the "right way" to solve the problem
though.  How much does the dcache need to know about this "in flight"
block (ok, blocks - I can imagine a pathological case where there
were a stack of them all slightly different in the queue)?

You'd be basically reinventing MVCC-like database logic with
transactional commits at that point - so each fs "barrier" call
would COW all the affected pages and write them down to disk.

Bron.

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-04-02  2:30                                                                                               ` Bron Gondwana
@ 2009-04-02  4:55                                                                                                 ` david
  2009-04-02  5:29                                                                                                   ` Bron Gondwana
  2009-04-02  9:58                                                                                                   ` Andreas T.Auer
  0 siblings, 2 replies; 664+ messages in thread
From: david @ 2009-04-02  4:55 UTC (permalink / raw)
  To: Bron Gondwana; +Cc: Andreas T.Auer, Bill Davidsen, linux-kernel

On Thu, 2 Apr 2009, Bron Gondwana wrote:

> On Wed, Apr 01, 2009 at 03:29:29PM -0700, david@lang.hm wrote:
>> On Wed, 1 Apr 2009, Andreas T.Auer wrote:
>>> On 01.04.2009 22:15 david@lang.hm wrote:
>>>> except if another file in the directory gets modified while it's
>>>> writing out the first two, that file now would need to get written out
>>>> as well, before the metadata for that directory can be written. if you
>>>> have a busy system (say a database or log server), where files are
>>>> getting modified pretty constantly, it can be a long time before all
>>>> the file data is written out and the system is idle enough to write
>>>> the metadata.
>>> Thank you, David, for this use case, but I think the problem could be
>>> solved quite easily:
>>>
>>> At any write-out time, e.g. after collecting enough data for delayed
>>> allocation or at fsync()
>>>
>>> 1) copy the metadata in memory, i.e. snapshot it
>>> 2) write out the data corresponding to the metadata-snapshot
>>> 3) write out the snapshot of the metadata
>>>
>>> In that way subsequent metadata changes should not interfere with the
>>> metadata-update on disk.
>>
>> the problem with this approach is that the dcache has no provision for
>> there being two (or more) copies of the disk block in it's cache, adding
>> this would significantly complicate things (it was mentioned briefly a
>> few days ago in this thread)
>
> It seems that it's obviously the "right way" to solve the problem
> though.  How much does the dcache need to know about this "in flight"
> block (ok, blocks - I can imagine a pathological case where there
> were a stack of them all slightly different in the queue)?

but if only one filesystem needs this caability is it really worth 
complicating the dcache for the entire system?

> You'd be basically reinventing MVCC-like database logic with
> transactional commits at that point - so each fs "barrier" call
> would COW all the affected pages and write them down to disk.

one aspect of mvcc systems is that they eat up space and require 'garbage 
collection' type functions. that could cause deadlocks if you aren't 
careful.

David Lang

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-04-02  4:55                                                                                                 ` david
@ 2009-04-02  5:29                                                                                                   ` Bron Gondwana
  2009-04-02  9:58                                                                                                   ` Andreas T.Auer
  1 sibling, 0 replies; 664+ messages in thread
From: Bron Gondwana @ 2009-04-02  5:29 UTC (permalink / raw)
  To: david; +Cc: Bron Gondwana, Andreas T.Auer, Bill Davidsen, linux-kernel

On Wed, Apr 01, 2009 at 09:55:18PM -0700, david@lang.hm wrote:
> On Thu, 2 Apr 2009, Bron Gondwana wrote:
>
>> On Wed, Apr 01, 2009 at 03:29:29PM -0700, david@lang.hm wrote:
>>> the problem with this approach is that the dcache has no provision for
>>> there being two (or more) copies of the disk block in it's cache, adding
>>> this would significantly complicate things (it was mentioned briefly a
>>> few days ago in this thread)
>>
>> It seems that it's obviously the "right way" to solve the problem
>> though.  How much does the dcache need to know about this "in flight"
>> block (ok, blocks - I can imagine a pathological case where there
>> were a stack of them all slightly different in the queue)?
>
> but if only one filesystem needs this caability is it really worth  
> complicating the dcache for the entire system?

Depends if that one filesystem is expected to have 90% of the
installed base or not, I guess.  If not, then it's not worth
it.  If having something like this makes that one filesystem
the best for the majority of workloads, then hell yes.

>> You'd be basically reinventing MVCC-like database logic with
>> transactional commits at that point - so each fs "barrier" call
>> would COW all the affected pages and write them down to disk.
>
> one aspect of mvcc systems is that they eat up space and require 'garbage 
> collection' type functions. that could cause deadlocks if you aren't  
> careful.

I guess the nice thing here is that the only consumer for the older
versions is the disk flushing thread, so figuring out when to cleanup
wouldn't be so hard as in a concurrent-users database.

But I'm speculating with no little hands-on experience with the
code.  I just know I'd like the result...

Bron ( creating consistent pages on disk that never really
       existed in memory sounds... exciting )

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-04-02  4:55                                                                                                 ` david
  2009-04-02  5:29                                                                                                   ` Bron Gondwana
@ 2009-04-02  9:58                                                                                                   ` Andreas T.Auer
  1 sibling, 0 replies; 664+ messages in thread
From: Andreas T.Auer @ 2009-04-02  9:58 UTC (permalink / raw)
  To: david; +Cc: Bron Gondwana, Andreas T.Auer, Bill Davidsen, linux-kernel



On 02.04.2009 06:55 david@lang.hm wrote:
> On Thu, 2 Apr 2009, Bron Gondwana wrote:
>
>   
>> On Wed, Apr 01, 2009 at 03:29:29PM -0700, david@lang.hm wrote:
>>     
>>> On Wed, 1 Apr 2009, Andreas T.Auer wrote:
>>>       
>>>> On 01.04.2009 22:15 david@lang.hm wrote:
>>>>         
>>>>> except if another file in the directory gets modified while it's
>>>>> writing out the first two, that file now would need to get written out
>>>>> as well, before the metadata for that directory can be written. if you
>>>>> have a busy system (say a database or log server), where files are
>>>>> getting modified pretty constantly, it can be a long time before all
>>>>> the file data is written out and the system is idle enough to write
>>>>> the metadata.
>>>>>           
>>>> Thank you, David, for this use case, but I think the problem could be
>>>> solved quite easily:
>>>>
>>>> At any write-out time, e.g. after collecting enough data for delayed
>>>> allocation or at fsync()
>>>>
>>>> 1) copy the metadata in memory, i.e. snapshot it
>>>> 2) write out the data corresponding to the metadata-snapshot
>>>> 3) write out the snapshot of the metadata
>>>>
>>>> In that way subsequent metadata changes should not interfere with the
>>>> metadata-update on disk.
>>>>         
>>> the problem with this approach is that the dcache has no provision for
>>> there being two (or more) copies of the disk block in it's cache, adding
>>> this would significantly complicate things (it was mentioned briefly a
>>> few days ago in this thread)
>>>       

I must have missed that message and can't find it.

>> It seems that it's obviously the "right way" to solve the problem
>> though.  How much does the dcache need to know about this "in flight"
>> block (ok, blocks - I can imagine a pathological case where there
>> were a stack of them all slightly different in the queue)?
>>     
>
> but if only one filesystem needs this caability is it really worth 
> complicating the dcache for the entire system?
>   

No, it's not necessary. It should be possible for the specific fs to
keep the metadata copy internally. And as long as these blocks are
written immediately after writing the data, there should be no "queue"
of copies, depending on how fsyncs are handled while the fs is
committing. There might be one copy for the current commit and (at most)
one copy corresponding to the most recent pending fsync. If there are
multiple fsyncs before the commit is finished, the "pending copy" could
simply be overwritten.

Andreas

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-04-01 21:03                           ` Lennart Sorensen
  2009-04-01 21:36                             ` Andrew Morton
@ 2009-04-02 11:05                             ` Janne Grunau
  2009-04-02 16:09                               ` Andrew Morton
                                                 ` (3 more replies)
  2009-04-02 12:17                             ` Theodore Tso
  2 siblings, 4 replies; 664+ messages in thread
From: Janne Grunau @ 2009-04-02 11:05 UTC (permalink / raw)
  To: Lennart Sorensen
  Cc: Andrew Morton, Linus Torvalds, Theodore Tso, David Rees,
	Jesper Krogh, Linux Kernel Mailing List

On Wed, Apr 01, 2009 at 05:03:38PM -0400, Lennart Sorensen wrote:
> On Thu, Mar 26, 2009 at 06:25:19PM -0700, Andrew Morton wrote:
> > The JBD journal is a massive designed-in contention point.  It's why
> > for several years I've been telling anyone who will listen that we need
> > a new fs.  Hopefully our response to all these problems will soon be
> > "did you try btrfs?".
> 
> Oh I look forward to the day when it will be safe to convert my mythtv
> box from ext3 to btrfs.

You could convert it to xfs now. xfs is probably the file system with
the lowest complaints usage ratio within the mythyv community.
Using distinct discs for system and recording storage helps too.

> Current kernels just have too much IO latency
> with ext3 it seems.

MythTV calls fsync every few seconds on ongoing recordings to prevent
stalls due to large cache writebacks on ext3.

cheers

Janne
(MythTV developer)

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-04-01 21:03                           ` Lennart Sorensen
  2009-04-01 21:36                             ` Andrew Morton
  2009-04-02 11:05                             ` Janne Grunau
@ 2009-04-02 12:17                             ` Theodore Tso
  2009-04-02 21:54                               ` Lennart Sorensen
  2 siblings, 1 reply; 664+ messages in thread
From: Theodore Tso @ 2009-04-02 12:17 UTC (permalink / raw)
  To: Lennart Sorensen
  Cc: Andrew Morton, Linus Torvalds, David Rees, Jesper Krogh,
	Linux Kernel Mailing List

On Wed, Apr 01, 2009 at 05:03:38PM -0400, Lennart Sorensen wrote:
> On Thu, Mar 26, 2009 at 06:25:19PM -0700, Andrew Morton wrote:
> > The JBD journal is a massive designed-in contention point.  It's why
> > for several years I've been telling anyone who will listen that we need
> > a new fs.  Hopefully our response to all these problems will soon be
> > "did you try btrfs?".
> 
> Oh I look forward to the day when it will be safe to convert my mythtv
> box from ext3 to btrfs.  Current kernels just have too much IO latency
> with ext3 it seems.  Older kernels were more responsive, but probably
> had other places they were less efficient.

Well, ext4 will be an interim solution you can convert to first.  It
will be best with a backup/reformat/restore pass, better if you enable
extents (at least for new files, but then you won't be able to go back
to ext3), but you'll get improvements even if you just mount an ext3
filesystem as ext4.

						- Ted

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-04-01 22:29                                                                                             ` david
  2009-04-02  2:30                                                                                               ` Bron Gondwana
@ 2009-04-02 12:30                                                                                               ` Bill Davidsen
  1 sibling, 0 replies; 664+ messages in thread
From: Bill Davidsen @ 2009-04-02 12:30 UTC (permalink / raw)
  To: david; +Cc: Andreas T.Auer, linux-kernel

david@lang.hm wrote:
> On Wed, 1 Apr 2009, Andreas T.Auer wrote:
>> Thank you, David, for this use case, but I think the problem could be
>> solved quite easily:
>>
>> At any write-out time, e.g. after collecting enough data for delayed
>> allocation or at fsync()
>>
>> 1) copy the metadata in memory, i.e. snapshot it
>> 2) write out the data corresponding to the metadata-snapshot
>> 3) write out the snapshot of the metadata
>>
>> In that way subsequent metadata changes should not interfere with the
>> metadata-update on disk.
>
> the problem with this approach is that the dcache has no provision for 
> there being two (or more) copies of the disk block in it's cache, 
> adding this would significantly complicate things (it was mentioned 
> briefly a few days ago in this thread)

I think the sync point should be between the file system and the dcache, 
with the data only going into the dcache when it's time to write it. 
That also opens the door to doing atime better at no cost, atime changes 
would be kept internal to the file system, and only be written at close 
or fsync, even on a mount which does not use noatime or relatime. The 
file system can keep that information and only write it when appropriate.

-- 
bill davidsen <davidsen@tmr.com>
  CTO TMR Associates, Inc

"You are disgraced professional losers. And by the way, give us our money back."
    - Representative Earl Pomeroy,  Democrat of North Dakota
on the A.I.G. executives who were paid bonuses  after a federal bailout.



^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-24  6:19 ` Jesper Krogh
  2009-03-24  6:46   ` David Rees
@ 2009-04-02 14:00   ` Mathieu Desnoyers
  1 sibling, 0 replies; 664+ messages in thread
From: Mathieu Desnoyers @ 2009-04-02 14:00 UTC (permalink / raw)
  To: Jesper Krogh
  Cc: Linus Torvalds, Linux Kernel Mailing List, Theodore Tso,
	Ingo Molnar, David Rees, Alan Cox

> 
> Linus Torvalds wrote:
> > This obviously starts the merge window for 2.6.30, although as usual, I'll 
> > probably wait a day or two before I start actively merging. I do that in 
> > order to hopefully result in people testing the final plain 2.6.29 a bit 
> > more before all the crazy changes start up again.
> 
> I know this has been discussed before:
> 
> [129401.996244] INFO: task updatedb.mlocat:31092 blocked for more than 
> 480 seconds.
> [129402.084667] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" 
> disables this message.
> [129402.179331] updatedb.mloc D 0000000000000000     0 31092  31091
> [129402.179335]  ffff8805ffa1d900 0000000000000082 ffff8803ff5688a8 
> 0000000000001000
> [129402.179338]  ffffffff806cc000 ffffffff806cc000 ffffffff806d3e80 
> ffffffff806d3e80
> [129402.179341]  ffffffff806cfe40 ffffffff806d3e80 ffff8801fb9f87e0 
> 000000000000ffff
> [129402.179343] Call Trace:
> [129402.179353]  [<ffffffff802d3ff0>] sync_buffer+0x0/0x50
> [129402.179358]  [<ffffffff80493a50>] io_schedule+0x20/0x30
> [129402.179360]  [<ffffffff802d402b>] sync_buffer+0x3b/0x50
> [129402.179362]  [<ffffffff80493d2f>] __wait_on_bit+0x4f/0x80
> [129402.179364]  [<ffffffff802d3ff0>] sync_buffer+0x0/0x50
> [129402.179366]  [<ffffffff80493dda>] out_of_line_wait_on_bit+0x7a/0xa0
> [129402.179369]  [<ffffffff80252730>] wake_bit_function+0x0/0x30
> [129402.179396]  [<ffffffffa0264346>] ext3_find_entry+0xf6/0x610 [ext3]
> [129402.179399]  [<ffffffff802d3453>] __find_get_block+0x83/0x170
> [129402.179403]  [<ffffffff802c4a90>] ifind_fast+0x50/0xa0
> [129402.179405]  [<ffffffff802c5874>] iget_locked+0x44/0x180
> [129402.179412]  [<ffffffffa0266435>] ext3_lookup+0x55/0x100 [ext3]
> [129402.179415]  [<ffffffff802c32a7>] d_alloc+0x127/0x1c0
> [129402.179417]  [<ffffffff802ba2a7>] do_lookup+0x1b7/0x250
> [129402.179419]  [<ffffffff802bc51d>] __link_path_walk+0x76d/0xd60
> [129402.179421]  [<ffffffff802ba17f>] do_lookup+0x8f/0x250
> [129402.179424]  [<ffffffff802c8b37>] mntput_no_expire+0x27/0x150
> [129402.179426]  [<ffffffff802bcb64>] path_walk+0x54/0xb0
> [129402.179428]  [<ffffffff802bfd10>] filldir+0x0/0xf0
> [129402.179430]  [<ffffffff802bcc8a>] do_path_lookup+0x7a/0x150
> [129402.179432]  [<ffffffff802bbb55>] getname+0xe5/0x1f0
> [129402.179434]  [<ffffffff802bd8d4>] user_path_at+0x44/0x80
> [129402.179437]  [<ffffffff802b53b5>] cp_new_stat+0xe5/0x100
> [129402.179440]  [<ffffffff802b56d0>] vfs_lstat_fd+0x20/0x60
> [129402.179442]  [<ffffffff802b5737>] sys_newlstat+0x27/0x50
> [129402.179445]  [<ffffffff8020c35b>] system_call_fastpath+0x16/0x1b
> Consensus seems to be something with large memory machines, lots of 
> dirty pages and a long writeout time due to ext3.
> 
> At the moment this the largest "usabillity" issue in the serversetup I'm 
> working with. Can there be done something to "autotune" it .. or perhaps 
> even fix it? .. or is it just to shift to xfs or wait for ext4?
> 

Hi Jesper,

What you are seeing looks awefully like the bug I have spent some time
to try to figure out in this bugzilla thread :

[Bug 12309] Large I/O operations result in slow performance and high
            iowait times
http://bugzilla.kernel.org/show_bug.cgi?id=12309

I created a fio test case out of a lttng trace to reproduce the problem
and created a patch to try to account the pages used by the i/o elevator
in the vm page count used to calculate memory pressure. Basically, the
behavior I was seeing is a constant increase of memory usage when doing
a dd-like write to disk until the memory fills up, which is indeed
wrong. The patch I posted in that thread seems to cause other problems
though, so probably we should teach kjournald to do better.

Here is the patch attempt :
http://bugzilla.kernel.org/attachment.cgi?id=20172

Here is the fio test case :
http://bugzilla.kernel.org/attachment.cgi?id=19894

My findings were this (I hope other people with deeper knowledge of
block layer/vm interaction can correct me) :

- Upon heavy and long disk writes, the pages used to back the buffers
  continuously increase as if there was no memory pressure at all.
  Therefore, I suspect they are held in a nowhere land that's unaccounted
  for at the vm layer (not part of memory pressure). That would seem to
  be the I/O elevator.

Can you give a try at the dd and fio test cases pointed out in the
bugzilla entry ? You may also want to see if my patch helps to partially
solve your problem. Another hint is to try to use the cgroups to
restrict you heavy I/O processes to a limited amount of memory;
although it does not solve the core of the problem, it made it disappear
for me. And of course trying to get a LTTng trace to get your head
around the problem can be very efficient. It's available as a git tree
over 2.6.29, and includes VFS, block I/O layer and vm instrumentation,
which helps looking at their interaction. All information is at
http://www.lttng.org.

Hoping this helps,

Mathieu

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-04-02 11:05                             ` Janne Grunau
@ 2009-04-02 16:09                               ` Andrew Morton
  2009-04-02 16:33                                 ` David Rees
  2009-04-02 16:29                               ` David Rees
                                                 ` (2 subsequent siblings)
  3 siblings, 1 reply; 664+ messages in thread
From: Andrew Morton @ 2009-04-02 16:09 UTC (permalink / raw)
  To: Janne Grunau
  Cc: Lennart Sorensen, Linus Torvalds, Theodore Tso, David Rees,
	Jesper Krogh, Linux Kernel Mailing List

On Thu, 2 Apr 2009 13:05:32 +0200 Janne Grunau <j@jannau.net> wrote:

> MythTV calls fsync every few seconds on ongoing recordings to prevent
> stalls due to large cache writebacks on ext3.

It should use sync_file_range(SYNC_FILE_RANGE_WRITE).  That will

- have minimum latency.  It tries to avoid blocking at all.

- avoid writing metadata

- avoid syncing other unrelated files within ext3

- avoid waiting for the ext3 commit to complete.


^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-04-02 11:05                             ` Janne Grunau
  2009-04-02 16:09                               ` Andrew Morton
@ 2009-04-02 16:29                               ` David Rees
  2009-04-02 16:42                                 ` Andrew Morton
  2009-04-02 21:42                                 ` Theodore Tso
  2009-04-02 21:50                               ` Lennart Sorensen
  2009-04-03 15:07                               ` Mark Lord
  3 siblings, 2 replies; 664+ messages in thread
From: David Rees @ 2009-04-02 16:29 UTC (permalink / raw)
  To: Janne Grunau
  Cc: Lennart Sorensen, Andrew Morton, Linus Torvalds, Theodore Tso,
	Jesper Krogh, Linux Kernel Mailing List

On Thu, Apr 2, 2009 at 4:05 AM, Janne Grunau <j@jannau.net> wrote:
>> Current kernels just have too much IO latency
>> with ext3 it seems.
>
> MythTV calls fsync every few seconds on ongoing recordings to prevent
> stalls due to large cache writebacks on ext3.

Personally that is also one of my MythTV pet peeves.  A hack added to
MythTV to work around a crappy ext3 latency bug that also causes these
large files to get heavily fragmented.  That and the fact that yo have
to patch MythTV to eliminate those forced fdatasyncs - there is no
knob to turn it off if you're running MythTV on a filesystem which
doesn't suffer from ext3's data=ordered fsync stalls.

-Dave

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-04-02 16:09                               ` Andrew Morton
@ 2009-04-02 16:33                                 ` David Rees
  2009-04-02 16:46                                   ` Linus Torvalds
                                                     ` (2 more replies)
  0 siblings, 3 replies; 664+ messages in thread
From: David Rees @ 2009-04-02 16:33 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Janne Grunau, Lennart Sorensen, Linus Torvalds, Theodore Tso,
	Jesper Krogh, Linux Kernel Mailing List

On Thu, Apr 2, 2009 at 9:09 AM, Andrew Morton <akpm@linux-foundation.org> wrote:
> On Thu, 2 Apr 2009 13:05:32 +0200 Janne Grunau <j@jannau.net> wrote:
>> MythTV calls fsync every few seconds on ongoing recordings to prevent
>> stalls due to large cache writebacks on ext3.
>
> It should use sync_file_range(SYNC_FILE_RANGE_WRITE).  That will
>
> - have minimum latency.  It tries to avoid blocking at all.
> - avoid writing metadata
> - avoid syncing other unrelated files within ext3
> - avoid waiting for the ext3 commit to complete.

MythTV actually uses fdatasync, not fsync (or at least that's what it
did last time I looked at the source).  Not sure how the behavior of
fdatasync compares to sync_file_range.

Either way - forcing the data to be synced to disk a couple times
every second is a hack and causes fragmentation in filesystems without
delayed allocation.  Fragmentation really goes up if you are recording
multiple shows at once.

-Dave

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-04-02 16:29                               ` David Rees
@ 2009-04-02 16:42                                 ` Andrew Morton
  2009-04-02 16:57                                   ` Linus Torvalds
  2009-04-02 18:52                                   ` David Rees
  2009-04-02 21:42                                 ` Theodore Tso
  1 sibling, 2 replies; 664+ messages in thread
From: Andrew Morton @ 2009-04-02 16:42 UTC (permalink / raw)
  To: David Rees
  Cc: Janne Grunau, Lennart Sorensen, Linus Torvalds, Theodore Tso,
	Jesper Krogh, Linux Kernel Mailing List

On Thu, 2 Apr 2009 09:29:59 -0700 David Rees <drees76@gmail.com> wrote:

> On Thu, Apr 2, 2009 at 4:05 AM, Janne Grunau <j@jannau.net> wrote:
> >> Current kernels just have too much IO latency
> >> with ext3 it seems.
> >
> > MythTV calls fsync every few seconds on ongoing recordings to prevent
> > stalls due to large cache writebacks on ext3.
> 
> Personally that is also one of my MythTV pet peeves.  A hack added to
> MythTV to work around a crappy ext3 latency bug that also causes these
> large files to get heavily fragmented.  That and the fact that yo have
> to patch MythTV to eliminate those forced fdatasyncs - there is no
> knob to turn it off if you're running MythTV on a filesystem which
> doesn't suffer from ext3's data=ordered fsync stalls.
> 

For any filesystem it is quite sensible for an application to manage
the amount of dirty memory which the kernel is holding on its behalf,
and based upon the application's knowledge of its future access
patterns.

But MythTV did it the wrong way.

A suitable design for the streaming might be, every 4MB:

- run sync_file_range(SYNC_FILE_RANGE_WRITE) to get the 4MB underway
  to the disk

- run fadvise(POSIX_FADV_DONTNEED) against the previous 4MB to
  discard it from pagecache.


^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-04-02 16:33                                 ` David Rees
@ 2009-04-02 16:46                                   ` Linus Torvalds
  2009-04-02 16:51                                   ` Andrew Morton
  2009-04-02 21:56                                   ` Jeff Garzik
  2 siblings, 0 replies; 664+ messages in thread
From: Linus Torvalds @ 2009-04-02 16:46 UTC (permalink / raw)
  To: David Rees
  Cc: Andrew Morton, Janne Grunau, Lennart Sorensen, Theodore Tso,
	Jesper Krogh, Linux Kernel Mailing List



On Thu, 2 Apr 2009, David Rees wrote:
> 
> MythTV actually uses fdatasync, not fsync (or at least that's what it
> did last time I looked at the source).  Not sure how the behavior of
> fdatasync compares to sync_file_range.

fdatasync() _waits_ for the data to hit the disk.

sync_file_range() just starts writeout.

It _can_ do more - you can also ask for it to wait for previous write-out 
in order to start _new_ writeout, or wait for the result, but you wouldn't 
want to, not for something like this.

sync_file_range() is really a much nicer interface, and is a more extended 
fdatasync() that actually matches what the kernel does internally.

You can think of fdatasync(fd) as a 

	sync_file_range(fd, 0, ~0ull,
		SYNC_FILE_RANGE_WAIT_BEFORE | SYNC_FILE_RANGE_WRITE | SYNC_FILE_RANGE_WAIT_AFTER);

and then you see why fdatasync is such a horrible interface.

		Linus

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-04-02 16:33                                 ` David Rees
  2009-04-02 16:46                                   ` Linus Torvalds
@ 2009-04-02 16:51                                   ` Andrew Morton
  2009-04-02 22:13                                     ` Jeff Garzik
  2009-04-02 21:56                                   ` Jeff Garzik
  2 siblings, 1 reply; 664+ messages in thread
From: Andrew Morton @ 2009-04-02 16:51 UTC (permalink / raw)
  To: David Rees
  Cc: Janne Grunau, Lennart Sorensen, Linus Torvalds, Theodore Tso,
	Jesper Krogh, Linux Kernel Mailing List

On Thu, 2 Apr 2009 09:33:44 -0700 David Rees <drees76@gmail.com> wrote:

> On Thu, Apr 2, 2009 at 9:09 AM, Andrew Morton <akpm@linux-foundation.org> wrote:
> > On Thu, 2 Apr 2009 13:05:32 +0200 Janne Grunau <j@jannau.net> wrote:
> >> MythTV calls fsync every few seconds on ongoing recordings to prevent
> >> stalls due to large cache writebacks on ext3.
> >
> > It should use sync_file_range(SYNC_FILE_RANGE_WRITE). __That will
> >
> > - have minimum latency. __It tries to avoid blocking at all.
> > - avoid writing metadata
> > - avoid syncing other unrelated files within ext3
> > - avoid waiting for the ext3 commit to complete.
> 
> MythTV actually uses fdatasync, not fsync (or at least that's what it
> did last time I looked at the source).  Not sure how the behavior of
> fdatasync compares to sync_file_range.

fdatasync() will still trigger the bad ext3 behaviour.

> Either way - forcing the data to be synced to disk a couple times
> every second is a hack and causes fragmentation in filesystems without
> delayed allocation.  Fragmentation really goes up if you are recording
> multiple shows at once.

The file layout issue is unrelated to the frequency of fdatasync() -
the block allocation is done at the time of write().

ext3 _should_ handle this case fairly well nowadays - I thought we fixed that.
However it would probably benefit from having the size of the block reservation
window increased - use ioctl(EXT3_IOC_SETRSVSZ).  That way, each file gets a
decent-sized hunk of disk "reserved" for its ongoing appending.  Other
files won't come in and intermingle their blocks with it.

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-04-02 16:42                                 ` Andrew Morton
@ 2009-04-02 16:57                                   ` Linus Torvalds
  2009-04-02 17:04                                     ` Linus Torvalds
  2009-04-02 18:52                                   ` David Rees
  1 sibling, 1 reply; 664+ messages in thread
From: Linus Torvalds @ 2009-04-02 16:57 UTC (permalink / raw)
  To: Andrew Morton
  Cc: David Rees, Janne Grunau, Lennart Sorensen, Theodore Tso,
	Jesper Krogh, Linux Kernel Mailing List



On Thu, 2 Apr 2009, Andrew Morton wrote:
> 
> A suitable design for the streaming might be, every 4MB:
> 
> - run sync_file_range(SYNC_FILE_RANGE_WRITE) to get the 4MB underway
>   to the disk
> 
> - run fadvise(POSIX_FADV_DONTNEED) against the previous 4MB to
>   discard it from pagecache.

Here's an example. I call it "overwrite.c" for obvious reasons.

Except I used 8MB ranges, and I "stream" random data. Very useful for 
"secure delete" of harddisks. It gives pretty optimal speed, while not 
destroying your system experience.

Of course, I do think the kernel could/should do this kind of thing 
automatically. We really could do something like this with a "dirty LRU" 
queue. Make the logic be:

 - if you have more than "2*limit" pages in your dirty LRU queue, start 
   writeout on "limit" pages (default value: 8MB, tunable in /proc). 
   Remove from LRU queues.

 - On writeback IO completion, if it's not on any LRU list, insert page 
   into "done_write" LRU list.

 - if you have more than "2*limit" pages on the done_write LRU queue, 
   try to just get rid of the first "limit" pages.

It would probably work fine in general. Temp-files (smaller than 8MB 
total) would go into the dirty LRU queue, but wouldn't be written out to 
disk if they get deleted before you've generated 8MB of dirty data.

But this does the queue-handling by hand, and gives you a throughput 
indicator. It should get fairly close to disk speeds.

		Linus

---
#include <unistd.h>
#include <stdlib.h>
#include <stdio.h>
#include <errno.h>
#include <fcntl.h>
#include <sys/mman.h>
#include <sys/time.h>
#include <linux/fs.h>

#define BUFSIZE (8*1024*1024ul)

int main(int argc, char **argv)
{
	static char buffer[BUFSIZE];
	struct timeval start, now;
	unsigned int index;
	int fd;

	mlockall(MCL_CURRENT | MCL_FUTURE);
	fd = open("/dev/urandom", O_RDONLY);
	if (read(fd, buffer, BUFSIZE) != BUFSIZE) {
		perror("/dev/urandom");
		exit(1);
	}
	close(fd);
	fd = open(argv[1], O_RDWR | O_CREAT, 0666);
	if (fd < 0) {
		perror(argv[1]);
		exit(1);
	}
	gettimeofday(&start, NULL);
	for (index = 0; ;index++) {
		double s;
		unsigned long MBps;
		unsigned long MB;

		if (write(fd, buffer, BUFSIZE) != BUFSIZE)
			break;
		sync_file_range(fd, index*BUFSIZE, BUFSIZE, SYNC_FILE_RANGE_WRITE);
		if (index)
			sync_file_range(fd, (index-1)*BUFSIZE, BUFSIZE, SYNC_FILE_RANGE_WAIT_BEFORE|SYNC_FILE_RANGE_WRITE|SYNC_FILE_RANGE_WAIT_AFTER);
		gettimeofday(&now, NULL);
		s = (now.tv_sec - start.tv_sec) + ((double) now.tv_usec - start.tv_usec)/ 1000000;

		MB = index * (BUFSIZE >> 20);
		MBps = MB;
		if (s > 1)
			MBps = MBps / s;
		printf("%8lu.%03lu GB written in %5.2f (%lu MB/s)           \r",
			MB >> 10, (MB & 1023) * 1000 >> 10, s, MBps);
		fflush(stdout);
	}
	close(fd);
	printf("\n");
	return 0;
}

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-04-02 16:57                                   ` Linus Torvalds
@ 2009-04-02 17:04                                     ` Linus Torvalds
  2009-04-02 22:09                                       ` Jeff Garzik
  2009-04-03 15:14                                       ` Mark Lord
  0 siblings, 2 replies; 664+ messages in thread
From: Linus Torvalds @ 2009-04-02 17:04 UTC (permalink / raw)
  To: Andrew Morton
  Cc: David Rees, Janne Grunau, Lennart Sorensen, Theodore Tso,
	Jesper Krogh, Linux Kernel Mailing List



On Thu, 2 Apr 2009, Linus Torvalds wrote:
> 
> On Thu, 2 Apr 2009, Andrew Morton wrote:
> > 
> > A suitable design for the streaming might be, every 4MB:
> > 
> > - run sync_file_range(SYNC_FILE_RANGE_WRITE) to get the 4MB underway
> >   to the disk
> > 
> > - run fadvise(POSIX_FADV_DONTNEED) against the previous 4MB to
> >   discard it from pagecache.
> 
> Here's an example. I call it "overwrite.c" for obvious reasons.

Oh, except my example doesn't do the fadvise. Instead, I make sure to 
throttle the writes and the old range with

	SYNC_FILE_RANGE_WAIT_BEFORE|SYNC_FILE_RANGE_WRITE|SYNC_FILE_RANGE_WAIT_AFTER

which makes sure that the old pages are easily dropped by the VM - and 
they will be, since they end up always being on the cold list.

I _wanted_ to add a SYNC_FILE_RANGE_DROP but I never bothered because this 
particular load it didn't matter. The system was perfectly usable while 
overwriting even huge disks because there was never more than 8MB of dirty 
data in flight in the IO queues at any time.

		Linus

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-04-02 16:42                                 ` Andrew Morton
  2009-04-02 16:57                                   ` Linus Torvalds
@ 2009-04-02 18:52                                   ` David Rees
  1 sibling, 0 replies; 664+ messages in thread
From: David Rees @ 2009-04-02 18:52 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Janne Grunau, Lennart Sorensen, Linus Torvalds, Theodore Tso,
	Jesper Krogh, Linux Kernel Mailing List

On Thu, Apr 2, 2009 at 9:42 AM, Andrew Morton <akpm@linux-foundation.org> wrote:
> For any filesystem it is quite sensible for an application to manage
> the amount of dirty memory which the kernel is holding on its behalf,
> and based upon the application's knowledge of its future access
> patterns.
>
> But MythTV did it the wrong way.
>
> A suitable design for the streaming might be, every 4MB:
>
> - run sync_file_range(SYNC_FILE_RANGE_WRITE) to get the 4MB underway
>  to the disk
>
> - run fadvise(POSIX_FADV_DONTNEED) against the previous 4MB to
>  discard it from pagecache.

Yep, you're right.  sync_file_range is perfect for what MythTV wants to do.

Though there are cases where MythTV can read data it wrote out not too
long ago, for example, when commercial flagging, so
fadvise(POSIX_FADV_DONTNEED) may not be warranted.

-Dave

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-30 22:00         ` Hans-Peter Jansen
  2009-03-30 22:07           ` Arjan van de Ven
@ 2009-04-02 19:01           ` Andreas T.Auer
  1 sibling, 0 replies; 664+ messages in thread
From: Andreas T.Auer @ 2009-04-02 19:01 UTC (permalink / raw)
  To: Hans-Peter Jansen
  Cc: Linus Torvalds, Mike Galbraith, Geert Uytterhoeven, linux-kernel, arjan



On 31.03.2009 00:00 Hans-Peter Jansen wrote:

> I build kernel rpms from your git tree, and have a bunch of BUILDs lying 
> around. 
So you have a place where you have a git repository from which you copy
the source tree to rpms, which have no connection to the git anymore?

> Sure, I can always fetch the tarballs or fiddle with git, but why? 

You may add a small script like this into .git/hooks/post-checkout:

-----
#!/bin/bash

if [ "$3" == 1 ]; then # don't do it for file checkouts
 sed -ri "s/^(EXTRAVERSION =.*)/\1$(scripts/setlocalversion)/" Makefile
fi
-----

That will append the EXTRAVERSION automatically with what
CONFIG_LOCALVERSION_AUTO=y would append to the version string.


> Having a Makefile start commit allows to make sure with simplest tools, 
> say "head Makefile" that a locally copied 2.6.29 tree is really a 2.6.29, 
> and not something moving towards the next release. That's all, nothing 
> less, nothing more, it's just a strong hint which blend is in the box.

If you are working on a tagged version, the EXTRAVERSION won't be
extended, on an untagged version it will have some ident for that
intermediate version

e.g.
git checkout master       -> EXTRAVERSION =-07100-g833bb30
git checkout HEAD~1       -> EXTRAVERSION =-07099-g8b53ef3
git checkout v2.6.29      -> EXTRAVERSION =
git checkout HEAD~1       -> EXTRAVERSION = -rc8-00303-g0030864
git checkout v2.6.29-rc8  -> EXTRAVERSION = -rc8

In that way your copies of the source tree will have the EXTRAVERSION
set in the Makefile. You can detect an intermediate version easily in
the Makefile and you even can checkout that exact version from the git
tree later, if you need to. Or just make an diff between two rpms by
diffing the versions taken from the Makefiles e.g.

git diff 07099-g8b53ef3..07100-g833bb30
or
git diff 00303-g0030864..v2.6.29

Attention:
Of course, the Makefile is changed in your working tree as if you had
changed it yourself. Therefore you have to use "git checkout Makefile"
to revert the changes before you can checkout a different version from
the git tree.

This is only a hack and there might be a better way to do it, but maybe
it helps as a starting point in your special situation.


Andreas


^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-04-02 16:29                               ` David Rees
  2009-04-02 16:42                                 ` Andrew Morton
@ 2009-04-02 21:42                                 ` Theodore Tso
  1 sibling, 0 replies; 664+ messages in thread
From: Theodore Tso @ 2009-04-02 21:42 UTC (permalink / raw)
  To: David Rees
  Cc: Janne Grunau, Lennart Sorensen, Andrew Morton, Linus Torvalds,
	Jesper Krogh, Linux Kernel Mailing List

On Thu, Apr 02, 2009 at 09:29:59AM -0700, David Rees wrote:
> On Thu, Apr 2, 2009 at 4:05 AM, Janne Grunau <j@jannau.net> wrote:
> >> Current kernels just have too much IO latency
> >> with ext3 it seems.
> >
> > MythTV calls fsync every few seconds on ongoing recordings to prevent
> > stalls due to large cache writebacks on ext3.
> 
> Personally that is also one of my MythTV pet peeves.  A hack added to
> MythTV to work around a crappy ext3 latency bug that also causes these
> large files to get heavily fragmented.  That and the fact that yo have
> to patch MythTV to eliminate those forced fdatasyncs - there is no
> knob to turn it off if you're running MythTV on a filesystem which
> doesn't suffer from ext3's data=ordered fsync stalls.

So use XFS or ext4, and use fallocate() to get the disk blocks
allocated ahead of time.  That completely avoids the fragmentation
problem, altogether.  If you are using ext3 on a dedicated MythTV box,
I would certainly advise mounting with data=writeback, which will also
avoid the latency bug.

					- Ted

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-04-02 11:05                             ` Janne Grunau
  2009-04-02 16:09                               ` Andrew Morton
  2009-04-02 16:29                               ` David Rees
@ 2009-04-02 21:50                               ` Lennart Sorensen
  2009-04-03 15:07                               ` Mark Lord
  3 siblings, 0 replies; 664+ messages in thread
From: Lennart Sorensen @ 2009-04-02 21:50 UTC (permalink / raw)
  To: Janne Grunau
  Cc: Andrew Morton, Linus Torvalds, Theodore Tso, David Rees,
	Jesper Krogh, Linux Kernel Mailing List

On Thu, Apr 02, 2009 at 01:05:32PM +0200, Janne Grunau wrote:
> You could convert it to xfs now. xfs is probably the file system with
> the lowest complaints usage ratio within the mythyv community.
> Using distinct discs for system and recording storage helps too.

Yeah, but I am not ready to give xfs another change yet.  The nasty bugs
back in 2.6.8ish days still hurt.  Locking up the filesystem when doing
rm -rf gcc-4.0 and having to repair it after a reboot was not fun.

> MythTV calls fsync every few seconds on ongoing recordings to prevent
> stalls due to large cache writebacks on ext3.

Yeah.  What I have been seeing since 2.6.24 or 2.6.25 or so is that it
sometimes simply doesn't start playback on a file, and after 15 seconds
or so times out, and then you ask it to try again and it works the next
time just fine.  Then at times it will stop responding to the keyboard or
remote in mythtv for up to 2 minutes, and then suddenly it will respond
to whatever you hit 2 minutes ago.  Fortunately that doesn't seem to
happen that often.  I was hoping to see if 2.6.28 helped that, but lirc
didn't seem to work on my remote with that version, so I went back to
2.6.26 again.  I haven't tried 2.6.29 on it yet since I am currently
trying to fix the debian nvidia-driver build against the new kbuild only
much too clever linux-headers-2.6.29 package they have come up with.
I think I have got that figured out though so I should be able to upgrade
that now.

-- 
Len Sorensen

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-04-02 12:17                             ` Theodore Tso
@ 2009-04-02 21:54                               ` Lennart Sorensen
  2009-04-02 23:27                                 ` Theodore Tso
  0 siblings, 1 reply; 664+ messages in thread
From: Lennart Sorensen @ 2009-04-02 21:54 UTC (permalink / raw)
  To: Theodore Tso, Andrew Morton, Linus Torvalds, David Rees,
	Jesper Krogh, Linux Kernel Mailing List

On Thu, Apr 02, 2009 at 08:17:35AM -0400, Theodore Tso wrote:
> Well, ext4 will be an interim solution you can convert to first.  It
> will be best with a backup/reformat/restore pass, better if you enable
> extents (at least for new files, but then you won't be able to go back
> to ext3), but you'll get improvements even if you just mount an ext3
> filesystem as ext4.

Well I did pickup a 1TB external USB/eSATA drive for pretty much such
a task.  I wasn't sure if ext4 was ready or stable enough to play
with yet though.

-- 
Len Sorensen

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-04-02 16:33                                 ` David Rees
  2009-04-02 16:46                                   ` Linus Torvalds
  2009-04-02 16:51                                   ` Andrew Morton
@ 2009-04-02 21:56                                   ` Jeff Garzik
  2 siblings, 0 replies; 664+ messages in thread
From: Jeff Garzik @ 2009-04-02 21:56 UTC (permalink / raw)
  To: David Rees
  Cc: Andrew Morton, Janne Grunau, Lennart Sorensen, Linus Torvalds,
	Theodore Tso, Jesper Krogh, Linux Kernel Mailing List

David Rees wrote:
> Either way - forcing the data to be synced to disk a couple times
> every second is a hack and causes fragmentation in filesystems without
> delayed allocation.  Fragmentation really goes up if you are recording
> multiple shows at once.

Check out posix_fallocate(3).  Not appropriate for every situation, 
might eat additional disk bandwidth...

But if you are looking to combat fragmentation, pre-allocation (manual 
or kernel-assisted) is a relevant technique.  Plus, overwriting existing 
data blocks is a LOT cheaper than appending to a file.  fsync's more 
quickly to disk, too.

	Jeff



^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-04-02 17:04                                     ` Linus Torvalds
@ 2009-04-02 22:09                                       ` Jeff Garzik
  2009-04-02 22:42                                         ` Linus Torvalds
  2009-04-03 15:14                                       ` Mark Lord
  1 sibling, 1 reply; 664+ messages in thread
From: Jeff Garzik @ 2009-04-02 22:09 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrew Morton, David Rees, Janne Grunau, Lennart Sorensen,
	Theodore Tso, Jesper Krogh, Linux Kernel Mailing List

Linus Torvalds wrote:
> 
> On Thu, 2 Apr 2009, Linus Torvalds wrote:
>> On Thu, 2 Apr 2009, Andrew Morton wrote:
>>> A suitable design for the streaming might be, every 4MB:
>>>
>>> - run sync_file_range(SYNC_FILE_RANGE_WRITE) to get the 4MB underway
>>>   to the disk
>>>
>>> - run fadvise(POSIX_FADV_DONTNEED) against the previous 4MB to
>>>   discard it from pagecache.
>> Here's an example. I call it "overwrite.c" for obvious reasons.
> 
> Oh, except my example doesn't do the fadvise. Instead, I make sure to 
> throttle the writes and the old range with
> 
> 	SYNC_FILE_RANGE_WAIT_BEFORE|SYNC_FILE_RANGE_WRITE|SYNC_FILE_RANGE_WAIT_AFTER
> 
> which makes sure that the old pages are easily dropped by the VM - and 
> they will be, since they end up always being on the cold list.

Dumb VM question, then:  I understand the logic behind the 
write-throttling part (some of my own userland code does something 
similar), but,

Does this imply adding fadvise to your overwrite.c example is (a) not 
noticable, (b) potentially less efficient, (c) potentially more efficient?

Or IOW, does fadvise purely put pages on the cold list as your 
sync_file_range incantation does, or something different?

Thanks,

	Jeff, who is already using sync_file_range in
	some server-esque userland projects



^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-04-02 16:51                                   ` Andrew Morton
@ 2009-04-02 22:13                                     ` Jeff Garzik
  0 siblings, 0 replies; 664+ messages in thread
From: Jeff Garzik @ 2009-04-02 22:13 UTC (permalink / raw)
  To: Andrew Morton
  Cc: David Rees, Janne Grunau, Lennart Sorensen, Linus Torvalds,
	Theodore Tso, Jesper Krogh, Linux Kernel Mailing List

Andrew Morton wrote:
> ext3 _should_ handle this case fairly well nowadays - I thought we fixed that.
> However it would probably benefit from having the size of the block reservation
> window increased - use ioctl(EXT3_IOC_SETRSVSZ).  That way, each file gets a
> decent-sized hunk of disk "reserved" for its ongoing appending.  Other
> files won't come in and intermingle their blocks with it.

How big of a chore would it be, to use this code to implement 
i_op->fallocate() for ext3, I wonder?

	Jeff



^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-04-02 22:09                                       ` Jeff Garzik
@ 2009-04-02 22:42                                         ` Linus Torvalds
  2009-04-02 22:51                                           ` Andrew Morton
  2009-04-03  2:01                                           ` Jeff Garzik
  0 siblings, 2 replies; 664+ messages in thread
From: Linus Torvalds @ 2009-04-02 22:42 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Andrew Morton, David Rees, Janne Grunau, Lennart Sorensen,
	Theodore Tso, Jesper Krogh, Linux Kernel Mailing List



On Thu, 2 Apr 2009, Jeff Garzik wrote:
> 
> Dumb VM question, then:  I understand the logic behind the write-throttling
> part (some of my own userland code does something similar), but,
> 
> Does this imply adding fadvise to your overwrite.c example is (a) not
> noticable, (b) potentially less efficient, (c) potentially more efficient?

For _that_ particular load it was more of a "it wasn't the issue". I 
wanted to get timely writeouts, because otherwise they bunch up and become 
unmanageable (with even the people who are not actually writing end up 
waiting for the writeouts). 

Once the pages are clean, it just didn't matter. The VM did the balancing 
right enough that I stopped caring. With other access patterns (ie if the 
pages ended up on the active list) the situation might have been 
different.

> Or IOW, does fadvise purely put pages on the cold list as your 
> sync_file_range incantation does, or something different?

sync_file_range() doesn't actually put the pages on the inactive list, but 
since the program was just a streaming one, they never even left it.

But no, fadvise actually tries to actually invalidate the pages (ie gets 
rid of them, as opposed to moving them to the inactive list).

Another note: I literally used that program just for whole-disk testing, 
so the behavior on an actual filesystem may or may not match. But I just 
tested on ext3 on my desktop, and got

     1.734 GB written in 30.38 (58 MB/s)           

until I ^C'd it, and I didn't have any sound skipping or anything like 
that. Of course, that's with those nice Intel SSD's, so that doesn't 
really say anything.

Feel free to give it a try. It _should_ maintain good write speed while 
not disturbing the system much. But I bet if you added the "fadvise()" it 
would disturb things even _less_.

My only point is really that you _can_ do streaming writes well, but at 
the same time I do think the kernel makes it too hard to do it with 
"simple" applications. I'd love to get the same kind of high-speed 
streaming behavior by just doing a simple "dd if=/dev/zero of=bigfile"

And I really think we should be able to.

And no, we clearly are _not_ able to do that now. I just tried with "dd", 
and created a 1.7G file that way, and it was stuttering - even with my 
nice SSD setup. I'm in my MUA writing this email (obviously), and in the 
middle it just totally hung for about half a minute - because it was 
obviously doing some fsync() for temporary saving etc while the "sync" was 
going on.

With the "overwrite.c" thing, I do get short pauses when my MUA does 
something, but they are not the kind of "oops, everything hung for several 
seconds" kind. 

(Full disclosure: 'alpine' with the local mbox on one disk - I _think_ 
that what alpine does is fsync() temporary save-files, but it might also 
be checking email in the background - I have not looked at _why_ alpine 
does an fsync, but it definitely does. And 5+ second delays are very 
annoying when writing emails - much less half a minute).

			Linus

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-04-02 22:42                                         ` Linus Torvalds
@ 2009-04-02 22:51                                           ` Andrew Morton
  2009-04-02 23:00                                             ` Linus Torvalds
  2009-04-03  2:01                                           ` Jeff Garzik
  1 sibling, 1 reply; 664+ messages in thread
From: Andrew Morton @ 2009-04-02 22:51 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: jeff, drees76, j, lsorense, tytso, jesper, linux-kernel

On Thu, 2 Apr 2009 15:42:51 -0700 (PDT)
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> My only point is really that you _can_ do streaming writes well, but at 
> the same time I do think the kernel makes it too hard to do it with 
> "simple" applications. I'd love to get the same kind of high-speed 
> streaming behavior by just doing a simple "dd if=/dev/zero of=bigfile"
> 
> And I really think we should be able to.

The thing which has always worried me about trying to do smart
drop-behind is the cost of getting it wrong - and sometimes it _will_
get it wrong.

Someone out there will have an important application which linearly
writes a 1G file and then reads it all back in again.  They will get
really upset when their runtime doubles.

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-04-02 22:51                                           ` Andrew Morton
@ 2009-04-02 23:00                                             ` Linus Torvalds
  0 siblings, 0 replies; 664+ messages in thread
From: Linus Torvalds @ 2009-04-02 23:00 UTC (permalink / raw)
  To: Andrew Morton; +Cc: jeff, drees76, j, lsorense, tytso, jesper, linux-kernel



On Thu, 2 Apr 2009, Andrew Morton wrote:
> 
> The thing which has always worried me about trying to do smart
> drop-behind is the cost of getting it wrong - and sometimes it _will_
> get it wrong.
> 
> Someone out there will have an important application which linearly
> writes a 1G file and then reads it all back in again.  They will get
> really upset when their runtime doubles.

Yes. The good news is that it would be a pretty easy tunable to have a 
"how soon do we writeback and how soon would we drop". And I do suspect 
that _dropping_ should default to off (exactly because of the kind of 
situation you bring up). 

As mentioned, at least in my experience the VM is pretty good at dropping 
the right pages anyway. It's when they are dirty or locked that we end up 
stuttering (or when we do fsync). And "start background writeout earlier" 
improves that case regardless of drop-behind.

But at the same time it is also unquestionably true that the current 
behavior tends to maximize throughput performance. Delaying the writes as 
long as possible is almost always the right thing for througput. 

In my experience, at least on desktops, latency is a lot more important 
than throughput is. And I don't think anybody wants to start the writes 
_immediately_. 

			Linus

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-04-02 21:54                               ` Lennart Sorensen
@ 2009-04-02 23:27                                 ` Theodore Tso
  2009-04-03  0:32                                   ` Lennart Sorensen
  0 siblings, 1 reply; 664+ messages in thread
From: Theodore Tso @ 2009-04-02 23:27 UTC (permalink / raw)
  To: Lennart Sorensen
  Cc: Andrew Morton, Linus Torvalds, David Rees, Jesper Krogh,
	Linux Kernel Mailing List

On Thu, Apr 02, 2009 at 05:54:42PM -0400, Lennart Sorensen wrote:
> On Thu, Apr 02, 2009 at 08:17:35AM -0400, Theodore Tso wrote:
> > Well, ext4 will be an interim solution you can convert to first.  It
> > will be best with a backup/reformat/restore pass, better if you enable
> > extents (at least for new files, but then you won't be able to go back
> > to ext3), but you'll get improvements even if you just mount an ext3
> > filesystem as ext4.
> 
> Well I did pickup a 1TB external USB/eSATA drive for pretty much such
> a task.  I wasn't sure if ext4 was ready or stable enough to play
> with yet though.

To play with, definitely.  For production use, I'll have to let you
make your own judgements.  I've been using it on my laptop since July.

At the moment, there's only one bug which I'm very concerned about,
being worked here:

https://bugs.launchpad.net/ubuntu/+source/linux/+bug/330824

But a number of community distro's will be supporting it within the
next month or two.  So it's definitely getting there.  As we increase
the user base, we'll turn up more of the harder-to-reproduce bugs, but
hopefully we'll get them fixed quickly.

					- Ted

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-04-02 23:27                                 ` Theodore Tso
@ 2009-04-03  0:32                                   ` Lennart Sorensen
  0 siblings, 0 replies; 664+ messages in thread
From: Lennart Sorensen @ 2009-04-03  0:32 UTC (permalink / raw)
  To: Theodore Tso, Andrew Morton, Linus Torvalds, David Rees,
	Jesper Krogh, Linux Kernel Mailing List

On Thu, Apr 02, 2009 at 07:27:15PM -0400, Theodore Tso wrote:
> To play with, definitely.  For production use, I'll have to let you
> make your own judgements.  I've been using it on my laptop since July.

Well I made a 75GB ext4 just to store temporary virtual machine images
to play with.  I won't be upset if I loose those.

> At the moment, there's only one bug which I'm very concerned about,
> being worked here:
> 
> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/330824
> 
> But a number of community distro's will be supporting it within the
> next month or two.  So it's definitely getting there.  As we increase
> the user base, we'll turn up more of the harder-to-reproduce bugs, but
> hopefully we'll get them fixed quickly.

Well pretty soon I will probably consider switching to that.  btrfs sounds
neat and all, but I will wait for the disk format to get finalized first.

-- 
Len Sorensen

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-04-02 22:42                                         ` Linus Torvalds
  2009-04-02 22:51                                           ` Andrew Morton
@ 2009-04-03  2:01                                           ` Jeff Garzik
  2009-04-03  2:16                                             ` Linus Torvalds
  2009-04-03  2:38                                             ` Trenton D. Adams
  1 sibling, 2 replies; 664+ messages in thread
From: Jeff Garzik @ 2009-04-03  2:01 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Andrew Morton, David Rees, Linux Kernel Mailing List

[-- Attachment #1: Type: text/plain, Size: 2707 bytes --]

Linus Torvalds wrote:
> Feel free to give it a try. It _should_ maintain good write speed while 
> not disturbing the system much. But I bet if you added the "fadvise()" it 
> would disturb things even _less_.
> 
> My only point is really that you _can_ do streaming writes well, but at 
> the same time I do think the kernel makes it too hard to do it with 
> "simple" applications. I'd love to get the same kind of high-speed 
> streaming behavior by just doing a simple "dd if=/dev/zero of=bigfile"
> 
> And I really think we should be able to.
> 
> And no, we clearly are _not_ able to do that now. I just tried with "dd", 
> and created a 1.7G file that way, and it was stuttering - even with my 
> nice SSD setup. I'm in my MUA writing this email (obviously), and in the 
> middle it just totally hung for about half a minute - because it was 
> obviously doing some fsync() for temporary saving etc while the "sync" was 
> going on.
> 
> With the "overwrite.c" thing, I do get short pauses when my MUA does 
> something, but they are not the kind of "oops, everything hung for several 
> seconds" kind. 

Attached is my slightly-modified version of overwrite.c, modded to bound 
the file size and to use fadvise().

On a 128GB, 3.0 Gbps no-name SATA SSD, x86-64, ext3, 2.6.29 vanilla kernel:

+ ./overwrite -b 3000 /spare/tmp/test.dat
writing 3000 buffers of size 8m
       23.429 GB written in 1019.25 (23 MB/s)

real	17m0.211s
user	0m0.028s
sys	1m5.800s


+ ./overwrite -b 3000 -f /spare/tmp/test.dat
using fadvise()
writing 3000 buffers of size 8m
       23.429 GB written in 1060.54 (22 MB/s)

real	17m41.446s
user	0m0.036s
sys	1m9.016s


The most interesting thing I found:  the SSD does 80 MB/s for the first 
~1 GB or so, then slows down dramatically.  After ~2GB, it is down to 32 
MB/s.  After ~4GB, it reaches a steady speed around 23 MB/s.


--------------------------------------------------

On a 500GB, 3.0Gbps Seagate SATA drive, x86-64, ext3, 2.6.29 vanilla kernel:

+ ./overwrite -b 3000 /garz/tmp/test.dat
writing 3000 buffers of size 8m
       23.429 GB written in 539.06 (44 MB/s)

real	9m0.348s
user	0m0.064s
sys	1m2.704s


+ ./overwrite -b 3000 -f /garz/tmp/test.dat
using fadvise()
writing 3000 buffers of size 8m
       23.429 GB written in 535.08 (44 MB/s)

real	8m55.971s
user	0m0.044s
sys	1m6.600s


There is a similar performance fall-off for the Seagate, but much less 
pronounced:
	After 1GB:	52 MB/s
	After 2GB:	44 MB/s
	After 3GB:	steady state




There appears to be a small increase in system time with "-f" (use 
fadvise), but I'm guessing time(1) does not really give a good picture 
of overall system time used, when you include background VM activity.

	Jeff




[-- Attachment #2: overwrite.c --]
[-- Type: text/plain, Size: 2187 bytes --]


#define _GNU_SOURCE
#include <unistd.h>
#include <stdlib.h>
#include <stdio.h>
#include <errno.h>
#include <fcntl.h>
#include <ctype.h>
#include <sys/mman.h>
#include <sys/time.h>
#include <linux/fs.h>

#define BUFSIZE (8*1024*1024ul)

static unsigned int maxbuf = ~0U;
static int do_fadvise;

static void parse_opt(int argc, char **argv)
{
	int ch;

	while (1) {
		ch = getopt(argc, argv, "fb:");
		if (ch == -1)
			break;

		switch (ch) {
		case 'f':
			do_fadvise = 1;
			fprintf(stderr, "using fadvise()\n");
			break;
		case 'b':
			if (atoi(optarg) > 1)
				maxbuf = atoi(optarg);
			else
				fprintf(stderr, "invalid bufcount '%s'\n",
					optarg);
			break;
		default:
			fprintf(stderr, "invalid option 0%o (%c)\n",
				ch,
				isprint(ch) ? ch : '-');
			break;
		}
	}
}

int main(int argc, char **argv)
{
	static char buffer[BUFSIZE];
	struct timeval start, now;
	unsigned int index;
	int fd;

	parse_opt(argc, argv);

	mlockall(MCL_CURRENT | MCL_FUTURE);
	fd = open("/dev/urandom", O_RDONLY);
	if (read(fd, buffer, BUFSIZE) != BUFSIZE) {
		perror("/dev/urandom");
		exit(1);
	}
	close(fd);

	fd = open(argv[optind], O_RDWR | O_CREAT, 0666);
	if (fd < 0) {
		perror(argv[optind]);
		exit(1);
	}

	if (maxbuf != ~0U)
		fprintf(stderr, "writing %u buffers of size %lum\n",
			maxbuf, BUFSIZE / (1024 * 1024ul));

	gettimeofday(&start, NULL);
	for (index = 0; index < maxbuf; index++) {
		double s;
		unsigned long MBps;
		unsigned long MB;

		if (write(fd, buffer, BUFSIZE) != BUFSIZE)
			break;
		sync_file_range(fd, index*BUFSIZE, BUFSIZE, SYNC_FILE_RANGE_WRITE);
		if (index)
			sync_file_range(fd, (index-1)*BUFSIZE, BUFSIZE, SYNC_FILE_RANGE_WAIT_BEFORE|SYNC_FILE_RANGE_WRITE|SYNC_FILE_RANGE_WAIT_AFTER);
		if (do_fadvise)
			posix_fadvise(fd, (index-1)*BUFSIZE, BUFSIZE,
				      POSIX_FADV_DONTNEED);
		gettimeofday(&now, NULL);
		s = (now.tv_sec - start.tv_sec) + ((double) now.tv_usec - start.tv_usec)/ 1000000;

		MB = index * (BUFSIZE >> 20);
		MBps = MB;
		if (s > 1)
			MBps = MBps / s;
		printf("%8lu.%03lu GB written in %5.2f (%lu MB/s)           \r",
			MB >> 10, (MB & 1023) * 1000 >> 10, s, MBps);
		fflush(stdout);
	}
	close(fd);
	printf("\n");
	return 0;
}

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-04-03  2:01                                           ` Jeff Garzik
@ 2009-04-03  2:16                                             ` Linus Torvalds
  2009-04-03  3:05                                               ` Jeff Garzik
                                                                 ` (2 more replies)
  2009-04-03  2:38                                             ` Trenton D. Adams
  1 sibling, 3 replies; 664+ messages in thread
From: Linus Torvalds @ 2009-04-03  2:16 UTC (permalink / raw)
  To: Jeff Garzik; +Cc: Andrew Morton, David Rees, Linux Kernel Mailing List



On Thu, 2 Apr 2009, Jeff Garzik wrote:
> 
> The most interesting thing I found:  the SSD does 80 MB/s for the first ~1 GB
> or so, then slows down dramatically.  After ~2GB, it is down to 32 MB/s.
> After ~4GB, it reaches a steady speed around 23 MB/s.

Are you sure that isn't an effect of double and triple indirect blocks 
etc? The metadata updates get more complex for the deeper indirect blocks.

Or just our page cache lookup? Maybe our radix tree thing hits something 
stupid. Although it sure shouldn't be _that_ noticeable.

> There is a similar performance fall-off for the Seagate, but much less
> pronounced:
> 	After 1GB:	52 MB/s
> 	After 2GB:	44 MB/s
> 	After 3GB:	steady state

That would seem to indicate that it's something else than the disk speed. 

> There appears to be a small increase in system time with "-f" (use fadvise),
> but I'm guessing time(1) does not really give a good picture of overall system
> time used, when you include background VM activity.

It would also be good to just compare it to something like

	time sh -c "dd + sync"

(Which in my experience tends to fluctuate much more than the steady state 
thing, so I suspect you'd need to do a few runs to make sure the numbers 
are stable).

			Linus

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-04-03  2:01                                           ` Jeff Garzik
  2009-04-03  2:16                                             ` Linus Torvalds
@ 2009-04-03  2:38                                             ` Trenton D. Adams
  2009-04-03  2:54                                               ` Jeff Garzik
  1 sibling, 1 reply; 664+ messages in thread
From: Trenton D. Adams @ 2009-04-03  2:38 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Linus Torvalds, Andrew Morton, David Rees, Linux Kernel Mailing List

On Thu, Apr 2, 2009 at 8:01 PM, Jeff Garzik <jeff@garzik.org> wrote:
> Linus Torvalds wrote:
> The most interesting thing I found:  the SSD does 80 MB/s for the first ~1
> GB or so, then slows down dramatically.  After ~2GB, it is down to 32 MB/s.
>  After ~4GB, it reaches a steady speed around 23 MB/s.

Isn't that the kernel IO queue, and the dd averaging of transfer
speed?  For example, once you hit the dirty ratio limit, that is when
it starts writing to disk.  So, the first bit you'll see really fast
speeds, as it goes to memory, but it averages out over time to a
slower speed.  As an example...

tdamac ~ # dd if=/dev/zero of=/tmp/bigfile bs=1M count=1
1+0 records in
1+0 records out
1048576 bytes (1.0 MB) copied, 0.00489853 s, 214 MB/s

tdamac ~ # dd if=/dev/zero of=/tmp/bigfile bs=1M count=10
10+0 records in
10+0 records out
10485760 bytes (10 MB) copied, 0.242217 s, 43.3 MB/s

Those are with /proc/sys/vm/dirty_bytes set to 1M...
echo $((1024*1024*1)) > /proc/sys/vm/dirty_bytes

It's probably better to set it much higher though.

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-04-03  2:38                                             ` Trenton D. Adams
@ 2009-04-03  2:54                                               ` Jeff Garzik
  0 siblings, 0 replies; 664+ messages in thread
From: Jeff Garzik @ 2009-04-03  2:54 UTC (permalink / raw)
  To: Trenton D. Adams
  Cc: Linus Torvalds, Andrew Morton, David Rees, Linux Kernel Mailing List

Trenton D. Adams wrote:
> On Thu, Apr 2, 2009 at 8:01 PM, Jeff Garzik <jeff@garzik.org> wrote:
>> Linus Torvalds wrote:
>> The most interesting thing I found:  the SSD does 80 MB/s for the first ~1
>> GB or so, then slows down dramatically.  After ~2GB, it is down to 32 MB/s.
>>  After ~4GB, it reaches a steady speed around 23 MB/s.
> 
> Isn't that the kernel IO queue, and the dd averaging of transfer
> speed?  For example, once you hit the dirty ratio limit, that is when
> it starts writing to disk.  So, the first bit you'll see really fast
> speeds, as it goes to memory, but it averages out over time to a
> slower speed.  As an example...

overwrite.c is a special program that does this, in a loop:

	write(buffer-N) data to pagecache
	start buffer-N write-out to storage
	wait for buffer-(N-1) write-out to complete

It uses the sync_file_range() system call, which is like fsync() on 
steroids, wearing cool sunglasses.

Regards,

	Jeff





^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-04-03  2:16                                             ` Linus Torvalds
@ 2009-04-03  3:05                                               ` Jeff Garzik
  2009-04-03  3:34                                                 ` Linus Torvalds
  2009-04-03  5:05                                               ` Nick Piggin
  2009-04-03  8:31                                               ` Jeff Garzik
  2 siblings, 1 reply; 664+ messages in thread
From: Jeff Garzik @ 2009-04-03  3:05 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Andrew Morton, David Rees, Linux Kernel Mailing List

Linus Torvalds wrote:
> 
> On Thu, 2 Apr 2009, Jeff Garzik wrote:
>> The most interesting thing I found:  the SSD does 80 MB/s for the first ~1 GB
>> or so, then slows down dramatically.  After ~2GB, it is down to 32 MB/s.
>> After ~4GB, it reaches a steady speed around 23 MB/s.
> 
> Are you sure that isn't an effect of double and triple indirect blocks 
> etc? The metadata updates get more complex for the deeper indirect blocks.

> Or just our page cache lookup? Maybe our radix tree thing hits something 
> stupid. Although it sure shouldn't be _that_ noticeable.

Indirect block overhead increased as the file grew to 23 GB, I'm sure...

I should probably re-test pre-creating the file, _then_ running 
overwrite.c.  That would at least guarantee the filesystem isn't 
allocating new blocks and metadata.

I was really surprised the performance was so high at first, then fell 
off so dramatically, on the SSD here.

Unfortunately I cannot trash these blkdevs, so the raw blkdev numbers 
are not immediately measurable.


>> There is a similar performance fall-off for the Seagate, but much less
>> pronounced:
>> 	After 1GB:	52 MB/s
>> 	After 2GB:	44 MB/s
>> 	After 3GB:	steady state
> 
> That would seem to indicate that it's something else than the disk speed. 
> 
>> There appears to be a small increase in system time with "-f" (use fadvise),
>> but I'm guessing time(1) does not really give a good picture of overall system
>> time used, when you include background VM activity.
> 
> It would also be good to just compare it to something like
> 
> 	time sh -c "dd + sync"

I'll add that to the next run...

	Jeff



^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-04-03  3:05                                               ` Jeff Garzik
@ 2009-04-03  3:34                                                 ` Linus Torvalds
  2009-04-03 11:32                                                   ` Chris Mason
  0 siblings, 1 reply; 664+ messages in thread
From: Linus Torvalds @ 2009-04-03  3:34 UTC (permalink / raw)
  To: Jeff Garzik; +Cc: Andrew Morton, David Rees, Linux Kernel Mailing List



On Thu, 2 Apr 2009, Jeff Garzik wrote:
> 
> I was really surprised the performance was so high at first, then fell off so
> dramatically, on the SSD here.

Well, one rather simple explanation is that if you hadn't been doing lots 
of writes, then the background garbage collection on the Intel SSD gets 
ahead of the game, and gives you lots of bursty nice write bandwidth due 
to having a nicely compacted and pre-erased blocks.

Then, after lots of writing, all the pre-erased blocks are gone, and you 
are down to a steady state where it needs to GC and erase blocks to make 
room for new writes.

So that part doesn't suprise me per se. The Intel SSD's definitely 
flucutate a bit timing-wise (but I love how they never degenerate to the 
"ooh, that _really_ sucks" case that the other SSD's and the rotational 
media I've seen does when you do random writes).

The fact that it also happens for the regular disk does imply that it's 
not the _only_ thing going on, though.

> Unfortunately I cannot trash these blkdevs, so the raw blkdev numbers are not
> immediately measurable.

Hey, understood. I don't think raw block accesses are even all that 
interesting. But you might try to write the file backwards, and see if you 
see the same pattern.

			Linus

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-04-02  1:00                               ` Ingo Molnar
@ 2009-04-03  4:06                                 ` Lennart Sorensen
  2009-04-03  4:13                                   ` Linus Torvalds
  2009-04-03 22:28                                   ` Jeff Moyer
  0 siblings, 2 replies; 664+ messages in thread
From: Lennart Sorensen @ 2009-04-03  4:06 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Andrew Morton, torvalds, tytso, drees76, jesper, linux-kernel

On Thu, Apr 02, 2009 at 03:00:44AM +0200, Ingo Molnar wrote:
> I'll test this (and the other suggestions) once i'm out of the merge 
> window.
> 
> I probably wont test that though ;-)
> 
> Going back to v2.6.14 to do pre-mutex-merge performance tests was 
> already quite a challenge on modern hardware.

Well after a day of running my mythtv box with anticipatiry rather than
the default cfq scheduler, it certainly looks a lot better.  I haven't
seen any slowdowns, the disk activity light isn't on solidly (it just
flashes every couple of seconds instead), and it doesn't even mind
me lanuching bittornado on multiple torrents at the same time as two
recordings are taking place and some commercial flagging is taking place.
With cfq this would usually make the system unusable (and a Q6600 with
6GB ram should never be unresponsive in my opinion).

So so far I would rank anticipatory at about 1000x better than cfq for
my work load.  It sure acts a lot more like it used to back in 2.6.18
times.

-- 
Len Sorensen

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-04-03  4:06                                 ` Lennart Sorensen
@ 2009-04-03  4:13                                   ` Linus Torvalds
  2009-04-03  7:25                                     ` Jens Axboe
  2009-04-03 22:28                                   ` Jeff Moyer
  1 sibling, 1 reply; 664+ messages in thread
From: Linus Torvalds @ 2009-04-03  4:13 UTC (permalink / raw)
  To: Lennart Sorensen, Jens Axboe
  Cc: Ingo Molnar, Andrew Morton, tytso, drees76, jesper,
	Linux Kernel Mailing List


Jens - remind us what the problem with AS was wrt CFQ?

There's some write throttling in CFQ, maybe it has some really broken 
case?

		Linus

On Fri, 3 Apr 2009, Lennart Sorensen wrote:

> On Thu, Apr 02, 2009 at 03:00:44AM +0200, Ingo Molnar wrote:
> > I'll test this (and the other suggestions) once i'm out of the merge 
> > window.
> > 
> > I probably wont test that though ;-)
> > 
> > Going back to v2.6.14 to do pre-mutex-merge performance tests was 
> > already quite a challenge on modern hardware.
> 
> Well after a day of running my mythtv box with anticipatiry rather than
> the default cfq scheduler, it certainly looks a lot better.  I haven't
> seen any slowdowns, the disk activity light isn't on solidly (it just
> flashes every couple of seconds instead), and it doesn't even mind
> me lanuching bittornado on multiple torrents at the same time as two
> recordings are taking place and some commercial flagging is taking place.
> With cfq this would usually make the system unusable (and a Q6600 with
> 6GB ram should never be unresponsive in my opinion).
> 
> So so far I would rank anticipatory at about 1000x better than cfq for
> my work load.  It sure acts a lot more like it used to back in 2.6.18
> times.
> 
> -- 
> Len Sorensen
> 

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-04-03  2:16                                             ` Linus Torvalds
  2009-04-03  3:05                                               ` Jeff Garzik
@ 2009-04-03  5:05                                               ` Nick Piggin
  2009-04-03  8:31                                               ` Jeff Garzik
  2 siblings, 0 replies; 664+ messages in thread
From: Nick Piggin @ 2009-04-03  5:05 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jeff Garzik, Andrew Morton, David Rees, Linux Kernel Mailing List

On Friday 03 April 2009 13:16:08 Linus Torvalds wrote:
> 
> On Thu, 2 Apr 2009, Jeff Garzik wrote:
> > 
> > The most interesting thing I found:  the SSD does 80 MB/s for the first ~1 GB
> > or so, then slows down dramatically.  After ~2GB, it is down to 32 MB/s.
> > After ~4GB, it reaches a steady speed around 23 MB/s.
> 
> Are you sure that isn't an effect of double and triple indirect blocks 
> etc? The metadata updates get more complex for the deeper indirect blocks.
> 
> Or just our page cache lookup? Maybe our radix tree thing hits something 
> stupid. Although it sure shouldn't be _that_ noticeable.

Hmm, I don't know what you have in mind. page cache lookup should be
several orders of magnitude faster than a disk can write the pages out?

Dirty/writeout/clean cycle still has to lock the radix tree to change
tags, but that's really not going to be significantly contended (nor
does it synchronise with simple lookups).

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-04-03  4:13                                   ` Linus Torvalds
@ 2009-04-03  7:25                                     ` Jens Axboe
  2009-04-03  8:15                                       ` Ingo Molnar
  2009-04-03 14:21                                       ` Lennart Sorensen
  0 siblings, 2 replies; 664+ messages in thread
From: Jens Axboe @ 2009-04-03  7:25 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Lennart Sorensen, Ingo Molnar, Andrew Morton, tytso, drees76,
	jesper, Linux Kernel Mailing List

On Thu, Apr 02 2009, Linus Torvalds wrote:
> 
> Jens - remind us what the problem with AS was wrt CFQ?

CFQ was just faster, plus it supported things like io priorities that AS
does not.

> There's some write throttling in CFQ, maybe it has some really broken 
> case?

Who knows, it's definitely interesting and something to look into why AS
performs that differently to CFQ on his box. Lennart, can you give some
information on what file system + mount options, disk drive(s), etc? A
full dmesg would be good, too.

> 
> 		Linus
> 
> On Fri, 3 Apr 2009, Lennart Sorensen wrote:
> 
> > On Thu, Apr 02, 2009 at 03:00:44AM +0200, Ingo Molnar wrote:
> > > I'll test this (and the other suggestions) once i'm out of the merge 
> > > window.
> > > 
> > > I probably wont test that though ;-)
> > > 
> > > Going back to v2.6.14 to do pre-mutex-merge performance tests was 
> > > already quite a challenge on modern hardware.
> > 
> > Well after a day of running my mythtv box with anticipatiry rather than
> > the default cfq scheduler, it certainly looks a lot better.  I haven't
> > seen any slowdowns, the disk activity light isn't on solidly (it just
> > flashes every couple of seconds instead), and it doesn't even mind
> > me lanuching bittornado on multiple torrents at the same time as two
> > recordings are taking place and some commercial flagging is taking place.
> > With cfq this would usually make the system unusable (and a Q6600 with
> > 6GB ram should never be unresponsive in my opinion).
> > 
> > So so far I would rank anticipatory at about 1000x better than cfq for
> > my work load.  It sure acts a lot more like it used to back in 2.6.18
> > times.
> > 
> > -- 
> > Len Sorensen
> > 

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-04-03  7:25                                     ` Jens Axboe
@ 2009-04-03  8:15                                       ` Ingo Molnar
  2009-04-06 21:46                                         ` Bill Davidsen
  2009-04-03 14:21                                       ` Lennart Sorensen
  1 sibling, 1 reply; 664+ messages in thread
From: Ingo Molnar @ 2009-04-03  8:15 UTC (permalink / raw)
  To: Jens Axboe, Nick Piggin
  Cc: Linus Torvalds, Lennart Sorensen, Andrew Morton, tytso, drees76,
	jesper, Linux Kernel Mailing List, Peter Zijlstra


* Jens Axboe <jens.axboe@oracle.com> wrote:

> On Thu, Apr 02 2009, Linus Torvalds wrote:

> > On Fri, 3 Apr 2009, Lennart Sorensen wrote:

> > > So so far I would rank anticipatory at about 1000x better than 
> > > cfq for my work load.  It sure acts a lot more like it used to 
> > > back in 2.6.18 times.
[...]

> > Jens - remind us what the problem with AS was wrt CFQ?
> 
> CFQ was just faster, plus it supported things like io priorities 
> that AS does not.

btw., while pluggable IO schedulers have their upsides:

 - They are easier to test during development and deployment.

 - The uptick of a new, experimental IO scheduler is faster due to 
   easier availability.

 - Regressions in the primary IO scheduler are easier to prove.

And the technical case for pluggable IO schedulers is much stronger 
than the case for pluggable process schedulers:

 - Persistent media has persistent workloads - and each workload has
   different access patterns.

 - The inefficiencies of mixed workloads on the same rotating media
   have forced a clear separation of the 'one disk, one workload'
   usage model, and has hammered this down people's minds. (Nobody 
   in their right mind is going to put a big Oracle and SAP
   installation on the same [rotating] disk.)

 - the 'NOP' scheduler makes sense on media with RAM-like
   properties. 90% of CFQ's overhead is useless fluff on such media.

 - [ These properties are not there for CPU schedulers: CPUs are 
     data processors not persistent data storage so they are 
     fundamentally shared by all workloads and have a lot less
     persistent state - so mixing workloads on CPUs is common and
     having one good scheduler is paramount. ]

At the risk of restarting the "to plug or not to plug" scheduler 
flamewars ;-), the pluggable IO scheduler design has its very clear 
downsides as well:

 - 99% of users use CFQ, so any bugs in it will hit 99% of the Linux 
   community and we have not actually won much in terms of helping 
   real people out in the field.

 - We are many years down the road of having replaced AS with the
   supposedly better CFQ - and AS is still (or again?) markedly
   better for some common tests.

 - The 1% of testers/users who find that CFQ sucks and track it down
   to CFQ can easily switch back to another IO scheduler: NOP or AS. 

   This dillutes the quality of _CFQ_, our crown jewel IO scheduler: 
   as it removes critical participiants from the pool of testers. 
   They might be only 1% of all Linux users, but they are the 1% who 
   make things happen upstream.

   The result: even if CFQ sucks for some important workloads, the
   combined social pressure is IMO never strong enough on upstream
   to get our act together. While we might fix the bugs reported 
   here, the time to realize and address these bugs was way too 
   long. Power-users configure they way out and go the path of least 
   resistance and the rest suffers in silence.

 - There's not even any feedback in the common case: people think
   "hey, what I'm doing must be some oddball thing" and leave it at 
   that. Even if that oddball thing is not odd at all. Furthermore, 
   getting feedback _after_ someone has solved their problems by 
   switching to AS is a lot harder than getting feedback while they 
   are still hurting and cursing. Yesterday's solved problem is 
   boring and a lot less worthy to report than today's high-prio 
   ticket.

 - It is _too easy_ to switch to AS, and shops with critical data 
   will not be as eager to report CFQ problems, and will not be as 
   eager to test experimental kernel patches that fix CFQ problems, 
   if they can switch to AS at the flip of a switch.

Ergo, i think pluggable designs for something as critical and as 
central as IO scheduling has its clear downsides as it created two 
mediocre schedulers:

 - CFQ with all the modern features but performance problems on 
   certain workloads

 - Anticipatory with legacy features only but works (much!) better 
   on some workloads.

... instead of giving us just a single well-working CFQ scheduler.

This, IMHO, in its current form, seems to trump the upsides of IO 
schedulers.

So i do think that late during development (i.e. now), _years_ down 
the line, we should make it gradually harder for people to use AS.

I'd not remove the AS code per se (it _is_ convenient to test it 
without having to patch the kernel - especially now that we _know_ 
that there is a common problem, and there _are_ genuinely oddball 
workloads where it might work better due to luck or design), but 
still we should:

 - Make it harder to configure in.

 - Change the /sys switch-to-AS method to break any existing scripts
   that switched CFQ to AS. Add a warning to the syslog if an old 
   script uses the old method and document the change prominetly but 
   do _not_ switch the IO scheduler to AS.

 - If the user still switched to AS, emit some scary warning about 
   this being an obsolete IO scheduler, that it is not being tested 
   as widely as CFQ and hence might have bugs, and that if the user
   still feels absolutely compelled to use it, to report his problem 
   to the appropriate mailing lists so that upstream can fix CFQ 
   instead.

By splintering the pool of testers and by removing testers from that 
pool who are the most important in getting our default IO scheduler 
tested we are not doing ourselves any favors.

Btw., my personal opinion is that even such extreme measures dont 
work fully right due to social factors, so _my_ preferred choice for 
doing such things is well known: to implement one good default 
scheduler and to fix all bugs in it ;-)

For IO schedulers i think there's just two sane technical choices 
for plugins: one good default scheduler (CFQ) or no IO scheduler at 
all (NOP).

The rest is development fuzz or migration fuzz - and such fuzz needs 
to be forced to zero after years of stabilization.

What do you think?

	Ingo

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-04-03  2:16                                             ` Linus Torvalds
  2009-04-03  3:05                                               ` Jeff Garzik
  2009-04-03  5:05                                               ` Nick Piggin
@ 2009-04-03  8:31                                               ` Jeff Garzik
  2009-04-03  8:35                                                 ` Jeff Garzik
  2 siblings, 1 reply; 664+ messages in thread
From: Jeff Garzik @ 2009-04-03  8:31 UTC (permalink / raw)
  To: Linux Kernel Mailing List; +Cc: Linus Torvalds, Andrew Morton, David Rees

[-- Attachment #1: Type: text/plain, Size: 1901 bytes --]

Linus Torvalds wrote:
> 
> On Thu, 2 Apr 2009, Jeff Garzik wrote:
>> The most interesting thing I found:  the SSD does 80 MB/s for the first ~1 GB
>> or so, then slows down dramatically.  After ~2GB, it is down to 32 MB/s.
>> After ~4GB, it reaches a steady speed around 23 MB/s.
> 
> Are you sure that isn't an effect of double and triple indirect blocks 
> etc? The metadata updates get more complex for the deeper indirect blocks.
> 
> Or just our page cache lookup? Maybe our radix tree thing hits something 
> stupid. Although it sure shouldn't be _that_ noticeable.
> 
>> There is a similar performance fall-off for the Seagate, but much less
>> pronounced:
>> 	After 1GB:	52 MB/s
>> 	After 2GB:	44 MB/s
>> 	After 3GB:	steady state
> 
> That would seem to indicate that it's something else than the disk speed. 

Attached are some additional tests using sync_file_range, dd, an SSD and 
a normal SATA disk.  The test program -- overwrite.c -- is unchanged 
from my last posting, basically the same as Linus's except with 
posix_fadvise()

Observations:

* the no-name SSD does seem to burst the first ~1GB of writes rapidly, 
but degrades to a much lower sustained level, as observed before. 
Repeated tests do not produce ~80 MB/s, only the first test, which lends 
credence to the theory about background activity.

* For the SSD, overwrite is noticeably faster than dd.

* For the Seagate NCQ hard drive, dd is noticeably faster than overwrite.

* fadvise() appears to help, but mostly the results are either 
inconclusive or lost in the noise:  A slight increase in throughput, and 
a slight increase in system time.

The test sequence for both SATA devices was the following:

	3 x dd
	3 x overwrite
	3 x overwrite w/ fadvise(don't need)

System setup: Intel Nahalem(sp?) x86-64, ICH10, Fedora 10, ext3 
filesystem (mounted defaults + noatime), 2.6.29 vanilla kernel.

Regards,

	Jeff






[-- Attachment #2: test-output.txt --]
[-- Type: text/plain, Size: 2804 bytes --]

=======================================================
128GB, 3.0 Gbps no-name SATA SSD, x86-64, ext3, 2.6.29 vanilla

First dd(1) creates the file, others simply rewrite it.
=======================================================
24000+0 records in
24000+0 records out
25165824000 bytes (25 GB) copied, 917.599 s, 27.4 MB/s)

real	15m30.928s
user	0m0.016s
sys	1m3.924s


24000+0 records in
24000+0 records out
25165824000 bytes (25 GB) copied, 1056.92 s, 23.8 MB/s)

real	18m1.686s
user	0m0.016s
sys	1m4.816s


24000+0 records in
24000+0 records out
25165824000 bytes (25 GB) copied, 1044.25 s, 24.1 MB/s)

real	17m37.884s
user	0m0.020s
sys	1m4.300s


writing 2800 buffers of size 8m
21.867 GB written in 645.56 (34 MB/s)

real	10m46.502s
user	0m0.044s
sys	0m35.990s


writing 2800 buffers of size 8m
21.867 GB written in 634.55 (35 MB/s)

real	10m35.448s
user	0m0.036s
sys	0m36.466s


writing 2800 buffers of size 8m
21.867 GB written in 642.00 (34 MB/s)

real	10m42.890s
user	0m0.044s
sys	0m34.930s


using fadvise()
writing 2800 buffers of size 8m
21.867 GB written in 639.49 (35 MB/s)

real	10m40.384s
user	0m0.036s
sys	0m38.582s


using fadvise()
writing 2800 buffers of size 8m
21.867 GB written in 636.17 (35 MB/s)

real	10m37.061s
user	0m0.024s
sys	0m39.146s


using fadvise()
writing 2800 buffers of size 8m
21.867 GB written in 636.07 (35 MB/s)

real	10m37.003s
user	0m0.060s
sys	0m39.174s


=======================================================
500GB, 3.0Gbps Seagate SATA drive, x86-64, ext3, 2.6.29 vanilla

First dd(1) creates the file, others simply rewrite it.
=======================================================
24000+0 records in
24000+0 records out
25165824000 bytes (25 GB) copied, 494.797 s, 50.9 MB/s)

real	8m42.680s
user	0m0.016s
sys	0m58.176s


24000+0 records in
24000+0 records out
25165824000 bytes (25 GB) copied, 498.295 s, 50.5 MB/s)

real	8m27.505s
user	0m0.016s
sys	0m58.744s


24000+0 records in
24000+0 records out
25165824000 bytes (25 GB) copied, 492.145 s, 51.1 MB/s)

real	8m23.616s
user	0m0.016s
sys	0m59.064s


writing 2800 buffers of size 8m
21.867 GB written in 478.41 (46 MB/s)

real	7m59.690s
user	0m0.032s
sys	0m33.210s


writing 2800 buffers of size 8m
21.867 GB written in 513.54 (43 MB/s)

real	8m34.461s
user	0m0.048s
sys	0m33.342s


writing 2800 buffers of size 8m
21.867 GB written in 471.38 (47 MB/s)

real	7m52.641s
user	0m0.020s
sys	0m33.486s


using fadvise()
writing 2800 buffers of size 8m
21.867 GB written in 467.67 (47 MB/s)

real	7m48.756s
user	0m0.048s
sys	0m36.838s


using fadvise()
writing 2800 buffers of size 8m
21.867 GB written in 462.69 (48 MB/s)

real	7m43.597s
user	0m0.020s
sys	0m37.462s


using fadvise()
writing 2800 buffers of size 8m
21.867 GB written in 463.56 (48 MB/s)

real	7m44.472s
user	0m0.036s
sys	0m37.342s



[-- Attachment #3: run-test.sh --]
[-- Type: application/x-sh, Size: 481 bytes --]

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-04-03  8:31                                               ` Jeff Garzik
@ 2009-04-03  8:35                                                 ` Jeff Garzik
  0 siblings, 0 replies; 664+ messages in thread
From: Jeff Garzik @ 2009-04-03  8:35 UTC (permalink / raw)
  To: Linux Kernel Mailing List; +Cc: Linus Torvalds, Andrew Morton, David Rees

Jeff Garzik wrote:
> Attached are some additional tests using sync_file_range, dd, an SSD and 
> a normal SATA disk.  The test program -- overwrite.c -- is unchanged 
> from my last posting, basically the same as Linus's except with 
> posix_fadvise()

Oh, and, as run-test.sh shows, these tests were done with the file 
pre-allocated and sync'd to disk.

The dd and overwrite invocations that follow the first dd invocation do 
/not/ require the fs to allocate new blocks.

	Jeff




^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-04-03  3:34                                                 ` Linus Torvalds
@ 2009-04-03 11:32                                                   ` Chris Mason
  2009-04-03 15:07                                                     ` Linus Torvalds
  0 siblings, 1 reply; 664+ messages in thread
From: Chris Mason @ 2009-04-03 11:32 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jeff Garzik, Andrew Morton, David Rees, Linux Kernel Mailing List

On Thu, 2009-04-02 at 20:34 -0700, Linus Torvalds wrote:
> 
> On Thu, 2 Apr 2009, Jeff Garzik wrote:
> > 
> > I was really surprised the performance was so high at first, then fell off so
> > dramatically, on the SSD here.
> 
> Well, one rather simple explanation is that if you hadn't been doing lots 
> of writes, then the background garbage collection on the Intel SSD gets 
> ahead of the game, and gives you lots of bursty nice write bandwidth due 
> to having a nicely compacted and pre-erased blocks.
> 
> Then, after lots of writing, all the pre-erased blocks are gone, and you 
> are down to a steady state where it needs to GC and erase blocks to make 
> room for new writes.
> 
> So that part doesn't suprise me per se. The Intel SSD's definitely 
> flucutate a bit timing-wise (but I love how they never degenerate to the 
> "ooh, that _really_ sucks" case that the other SSD's and the rotational 
> media I've seen does when you do random writes).
> 

23MB/s seems a bit low though, I'd try with O_DIRECT.  ext3 doesn't do
writepages, and the ssd may be very sensitive to smaller writes (what
brand?)

> The fact that it also happens for the regular disk does imply that it's 
> not the _only_ thing going on, though.
> 

Jeff if you blktrace it I can make up a seekwatcher graph.  My bet is
that pdflush is stuck writing the indirect blocks, and doing a ton of
seeks.

You could change the overwrite program to also do sync_file_range on the
block device ;)

-chris



^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-27  5:13                             ` Theodore Tso
  2009-03-27  5:57                               ` Matthew Garrett
@ 2009-04-03 12:39                               ` Pavel Machek
  1 sibling, 0 replies; 664+ messages in thread
From: Pavel Machek @ 2009-04-03 12:39 UTC (permalink / raw)
  To: Theodore Tso, Matthew Garrett, Linus Torvalds, Andrew Morton,
	David Rees, Jesper Krogh, Linux Kernel Mailing List

Hi!

> > I'm utterly and screamingly bored of this "Blame userspace" attitude. 
> 
> I'm not blaming userspace.  I'm blaming ourselves, for implementing an
> attractive nuisance, and not realizing that we had implemented an
> attractive nuisance; which years later, is also responsible for these
> latency problems, both with and without fsync() ---- *and* which have
> also traied people into believing that fsync() is always expensive,
> and must be avoided at all costs --- which had not previously been
> true!

Well... fsync is quite expensive. If your disk is down, it costs 3+
and 3J+. If your disk is up, it will only take 20msec+.

OTOH the rename trick on ext3 costs approximately nothing...

Imagine those desktops where they want windows layout
preserved. Having 30 second old layout is acceptable, loosing layout
altogether is not. If you add fsync to the window manager, user will
see those 3seconds+ delays, unless window manager gets multithreaded.

								Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-04-03  7:25                                     ` Jens Axboe
  2009-04-03  8:15                                       ` Ingo Molnar
@ 2009-04-03 14:21                                       ` Lennart Sorensen
  2009-04-03 15:05                                         ` Mark Lord
  1 sibling, 1 reply; 664+ messages in thread
From: Lennart Sorensen @ 2009-04-03 14:21 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Linus Torvalds, Ingo Molnar, Andrew Morton, tytso, drees76,
	jesper, Linux Kernel Mailing List

On Fri, Apr 03, 2009 at 09:25:07AM +0200, Jens Axboe wrote:
> CFQ was just faster, plus it supported things like io priorities that AS
> does not.

Faster at what?  I am now wondering if switching the servers at work to
anticipatory will make them be more responsive when an rsnapshot run is
done (which it is every 3 hours).  That would provide another data point.
It is currently very easy to tell when 10:00, 13:00, 16:00 and 19:00
comes around.

> Who knows, it's definitely interesting and something to look into why AS
> performs that differently to CFQ on his box. Lennart, can you give some
> information on what file system + mount options, disk drive(s), etc? A
> full dmesg would be good, too.

Well the system is setup like this:

Core 2 Quad Q6600 CPU (2.4GHz quad core).
Asus P5K mainboard (Intel P35 chipset)
6GB of ram
PVR500 dual NTSC tuner pci card
4 x 500GB WD5000AAKS SATA drives
	25GB sda1 + sdb1 raid1 for /
	25GB sdc1 + sdd1 raid1 for /home
	remaining as sda2 + sdb2 + sdc2 + sdd2 raid5 for LVM.
	1.2TB /var uses most of the LVM for mythtv storage and other data
	6GB swap on LVM
	94GB test volume on LVM (this uses ext4 but is hardly ever used)
	all filesystems other than the test one are ext3

I run the ICH9 in AHCI mode since in IDE mode it doesn't do 64bit DMA
and the bounce buffers seemed to be having issues keeping up.

So normal use of the machine is:

mythtv-backend + mysql takes care of the mythtv recording work.

mythtv-frontend with output on an nvidia 8600GT (with proprietary drivers
in use)
commercial flagging and some transcode to mpeg4 for shows I keep for a
while run in parallel, since after all there are 4 cores to use.

folding@home running smp (using all 4 cores) at idle priority

birtornado running with many torrents running slowly seeding (I limit it
to 5kb/s up at all times due to monthly caps on transfers from my ISP,
s this way it can do something consistently without going over).
It probably has 300GB worth of files being seeded at the moment.

So when I first built the machine it ran really nicely, but I think that
was with 2.6.16 or 2.6.18 or so.  It was a while ago.  It worked quite
well, responsiveness was good, etc.  2.6.24 - 2.6.26 has been not
so great.  Well until I switched the ioscheduler a couple of days ago.

So the behaviour with cfq is:
Disk light seems to be constantly on if there is any disk activity.  iotop
can show a total io of maybe 1MB/s and the disk light is on constantly.
Rarely does it seem to make it over about 15MB/s of total io.  Running a
sync command in the hopes that it will get a clue usually takes 20 to
30 seconds to complete.  Running it again right after it completes can
take another 5 seconds if something is recording at the time.  That seems
like a long time to handle a few seconds worth of MPEG2 NTSC video.
Starting playback on mythtv most often fails on the first attempt with a
15 second timeout, and then on the second attempt it will usually manage
to start playback.  Sometimes it takes a 3rd attempt.

The behaviour with anticipatory is:
Disk light flashes every second or two for a moment, iotop shows much
faster completion of io, and I have even seen 75MB/s worth of IO when
restarting the bittornado client and having it hash check the data
of files.  With cfq that never seemed to get over about 15MB/s and the
system would become unusable while doing it.  mythtv reponsiveness is
instant like it used to be in the past.  Running sync returns practically
immediately.  Maybe 1second in worst case (which was with 2 shows
recording, 2 transcoding, and bittorrent hash checking).

Anyhow here is dmesg from bootup.  The only thing I have seen in it since
boot (I unfortunately cleared it yesterday to see if any more messages
were happening since the change) are messages about mpeg2 data dropped
by ivtv because the system wasn't keeping up.  I haven't seen any of
those messages since the switch.  They were happening a lot before.

[    0.000000] Initializing cgroup subsys cpuset
[    0.000000] Initializing cgroup subsys cpu
[    0.000000] Linux version 2.6.26-1-amd64 (Debian 2.6.26-13) (waldi@debian.org) (gcc version 4.1.3 20080704 (prerelease) (Debian 4.1.2-24)) #1 SMP Sat Jan 10 17:57:00 UTC 2009
[    0.000000] Command line: root=/dev/md0 ro quiet 
[    0.000000] BIOS-provided physical RAM map:
[    0.000000]  BIOS-e820: 0000000000000000 - 000000000009ec00 (usable)
[    0.000000]  BIOS-e820: 000000000009ec00 - 00000000000a0000 (reserved)
[    0.000000]  BIOS-e820: 00000000000e4000 - 0000000000100000 (reserved)
[    0.000000]  BIOS-e820: 0000000000100000 - 00000000cff80000 (usable)
[    0.000000]  BIOS-e820: 00000000cff80000 - 00000000cff8e000 (ACPI data)
[    0.000000]  BIOS-e820: 00000000cff8e000 - 00000000cffe0000 (ACPI NVS)
[    0.000000]  BIOS-e820: 00000000cffe0000 - 00000000d0000000 (reserved)
[    0.000000]  BIOS-e820: 00000000fee00000 - 00000000fee01000 (reserved)
[    0.000000]  BIOS-e820: 00000000fff00000 - 0000000100000000 (reserved)
[    0.000000]  BIOS-e820: 0000000100000000 - 00000001b0000000 (usable)
[    0.000000] Entering add_active_range(0, 0, 158) 0 entries of 3200 used
[    0.000000] Entering add_active_range(0, 256, 851840) 1 entries of 3200 used
[    0.000000] Entering add_active_range(0, 1048576, 1769472) 2 entries of 3200 used
[    0.000000] max_pfn_mapped = 1769472
[    0.000000] init_memory_mapping
[    0.000000] DMI 2.4 present.
[    0.000000] ACPI: RSDP 000FBCD0, 0024 (r2 ACPIAM)
[    0.000000] ACPI: XSDT CFF80100, 0054 (r1 A_M_I_ OEMXSDT   3000805 MSFT       97)
[    0.000000] ACPI: FACP CFF80290, 00F4 (r3 A_M_I_ OEMFACP   3000805 MSFT       97)
[    0.000000] ACPI: DSDT CFF80440, 90AB (r1  A0871 A0871018       18 INTL 20060113)
[    0.000000] ACPI: FACS CFF8E000, 0040
[    0.000000] ACPI: APIC CFF80390, 006C (r1 A_M_I_ OEMAPIC   3000805 MSFT       97)
[    0.000000] ACPI: MCFG CFF80400, 003C (r1 A_M_I_ OEMMCFG   3000805 MSFT       97)
[    0.000000] ACPI: OEMB CFF8E040, 0081 (r1 A_M_I_ AMI_OEM   3000805 MSFT       97)
[    0.000000] ACPI: HPET CFF894F0, 0038 (r1 A_M_I_ OEMHPET   3000805 MSFT       97)
[    0.000000] ACPI: OSFR CFF89530, 00B0 (r1 A_M_I_ OEMOSFR   3000805 MSFT       97)
[    0.000000] No NUMA configuration found
[    0.000000] Faking a node at 0000000000000000-00000001b0000000
[    0.000000] Entering add_active_range(0, 0, 158) 0 entries of 3200 used
[    0.000000] Entering add_active_range(0, 256, 851840) 1 entries of 3200 used
[    0.000000] Entering add_active_range(0, 1048576, 1769472) 2 entries of 3200 used
[    0.000000] Bootmem setup node 0 0000000000000000-00000001b0000000
[    0.000000]   NODE_DATA [0000000000010000 - 0000000000014fff]
[    0.000000]   bootmap [0000000000015000 -  000000000004afff] pages 36
[    0.000000]   early res: 0 [0-fff] BIOS data page
[    0.000000]   early res: 1 [6000-7fff] TRAMPOLINE
[    0.000000]   early res: 2 [200000-675397] TEXT DATA BSS
[    0.000000]   early res: 3 [37775000-37fef581] RAMDISK
[    0.000000]   early res: 4 [9ec00-fffff] BIOS reserved
[    0.000000]   early res: 5 [8000-ffff] PGTABLE
[    0.000000]  [ffffe20000000000-ffffe20002dfffff] PMD -> [ffff810001200000-ffff810003ffffff] on node 0
[    0.000000]  [ffffe20003800000-ffffe20005ffffff] PMD -> [ffff81000c000000-ffff81000e7fffff] on node 0
[    0.000000] Zone PFN ranges:
[    0.000000]   DMA             0 ->     4096
[    0.000000]   DMA32        4096 ->  1048576
[    0.000000]   Normal    1048576 ->  1769472
[    0.000000] Movable zone start PFN for each node
[    0.000000] early_node_map[3] active PFN ranges
[    0.000000]     0:        0 ->      158
[    0.000000]     0:      256 ->   851840
[    0.000000]     0:  1048576 ->  1769472
[    0.000000] On node 0 totalpages: 1572638
[    0.000000]   DMA zone: 56 pages used for memmap
[    0.000000]   DMA zone: 1251 pages reserved
[    0.000000]   DMA zone: 2691 pages, LIFO batch:0
[    0.000000]   DMA32 zone: 14280 pages used for memmap
[    0.000000]   DMA32 zone: 833464 pages, LIFO batch:31
[    0.000000]   Normal zone: 9856 pages used for memmap
[    0.000000]   Normal zone: 711040 pages, LIFO batch:31
[    0.000000]   Movable zone: 0 pages used for memmap
[    0.000000] ACPI: PM-Timer IO Port: 0x808
[    0.000000] ACPI: Local APIC address 0xfee00000
[    0.000000] ACPI: LAPIC (acpi_id[0x01] lapic_id[0x00] enabled)
[    0.000000] ACPI: LAPIC (acpi_id[0x02] lapic_id[0x01] enabled)
[    0.000000] ACPI: LAPIC (acpi_id[0x03] lapic_id[0x02] enabled)
[    0.000000] ACPI: LAPIC (acpi_id[0x04] lapic_id[0x03] enabled)
[    0.000000] ACPI: IOAPIC (id[0x04] address[0xfec00000] gsi_base[0])
[    0.000000] IOAPIC[0]: apic_id 4, version 0, address 0xfec00000, GSI 0-23
[    0.000000] ACPI: INT_SRC_OVR (bus 0 bus_irq 0 global_irq 2 dfl dfl)
[    0.000000] ACPI: INT_SRC_OVR (bus 0 bus_irq 9 global_irq 9 high level)
[    0.000000] ACPI: IRQ0 used by override.
[    0.000000] ACPI: IRQ2 used by override.
[    0.000000] ACPI: IRQ9 used by override.
[    0.000000] Setting APIC routing to flat
[    0.000000] ACPI: HPET id: 0xffffffff base: 0xfed00000
[    0.000000] Using ACPI (MADT) for SMP configuration information
[    0.000000] PM: Registered nosave memory: 000000000009e000 - 000000000009f000
[    0.000000] PM: Registered nosave memory: 000000000009f000 - 00000000000a0000
[    0.000000] PM: Registered nosave memory: 00000000000a0000 - 00000000000e4000
[    0.000000] PM: Registered nosave memory: 00000000000e4000 - 0000000000100000
[    0.000000] PM: Registered nosave memory: 00000000cff80000 - 00000000cff8e000
[    0.000000] PM: Registered nosave memory: 00000000cff8e000 - 00000000cffe0000
[    0.000000] PM: Registered nosave memory: 00000000cffe0000 - 00000000d0000000
[    0.000000] PM: Registered nosave memory: 00000000d0000000 - 00000000fee00000
[    0.000000] PM: Registered nosave memory: 00000000fee00000 - 00000000fee01000
[    0.000000] PM: Registered nosave memory: 00000000fee01000 - 00000000fff00000
[    0.000000] PM: Registered nosave memory: 00000000fff00000 - 0000000100000000
[    0.000000] Allocating PCI resources starting at d4000000 (gap: d0000000:2ee00000)
[    0.000000] SMP: Allowing 4 CPUs, 0 hotplug CPUs
[    0.000000] PERCPU: Allocating 37168 bytes of per cpu data
[    0.000000] NR_CPUS: 32, nr_cpu_ids: 4
[    0.000000] Built 1 zonelists in Node order, mobility grouping on.  Total pages: 1547195
[    0.000000] Policy zone: Normal
[    0.000000] Kernel command line: root=/dev/md0 ro quiet 
[    0.000000] Initializing CPU#0
[    0.000000] PID hash table entries: 4096 (order: 12, 32768 bytes)
[    0.000000] Extended CMOS year: 2000
[    0.000000] TSC calibrated against PM_TIMER
[    0.000000] time.c: Detected 2405.452 MHz processor.
[    0.004000] Console: colour VGA+ 80x25
[    0.004000] console [tty0] enabled
[    0.004000] Checking aperture...
[    0.004000] Calgary: detecting Calgary via BIOS EBDA area
[    0.004000] Calgary: Unable to locate Rio Grande table in EBDA - bailing!
[    0.004000] PCI-DMA: Using software bounce buffering for IO (SWIOTLB)
[    0.004000] Placing software IO TLB between 0x4000000 - 0x8000000
[    0.004000] Memory: 6122760k/7077888k available (2226k kernel code, 167792k reserved, 1082k data, 392k init)
[    0.004000] CPA: page pool initialized 1 of 1 pages preallocated
[    0.004000] hpet clockevent registered
[    0.083783] Calibrating delay using timer specific routine.. 4895.00 BogoMIPS (lpj=9790007)
[    0.083821] Security Framework initialized
[    0.083826] SELinux:  Disabled at boot.
[    0.083829] Capability LSM initialized
[    0.084005] Dentry cache hash table entries: 1048576 (order: 11, 8388608 bytes)
[    0.088005] Inode-cache hash table entries: 524288 (order: 10, 4194304 bytes)
[    0.088005] Mount-cache hash table entries: 256
[    0.088185] Initializing cgroup subsys ns
[    0.088190] Initializing cgroup subsys cpuacct
[    0.088192] Initializing cgroup subsys devices
[    0.088211] CPU: L1 I cache: 32K, L1 D cache: 32K
[    0.088213] CPU: L2 cache: 4096K
[    0.088215] CPU 0/0 -> Node 0
[    0.088216] CPU: Physical Processor ID: 0
[    0.088217] CPU: Processor Core ID: 0
[    0.088224] CPU0: Thermal monitoring enabled (TM2)
[    0.088225] using mwait in idle threads.
[    0.089074] ACPI: Core revision 20080321
[    0.148009] CPU0: Intel(R) Core(TM)2 Quad CPU           @ 2.40GHz stepping 07
[    0.148009] Using local APIC timer interrupts.
[    0.152009] APIC timer calibration result 16704557
[    0.152009] Detected 16.704 MHz APIC timer.
[    0.152009] Booting processor 1/1 ip 6000
[    0.160010] Initializing CPU#1
[    0.160010] Calibrating delay using timer specific routine.. 4767.29 BogoMIPS (lpj=9534596)
[    0.160010] CPU: L1 I cache: 32K, L1 D cache: 32K
[    0.160010] CPU: L2 cache: 4096K
[    0.160010] CPU 1/1 -> Node 0
[    0.160010] CPU: Physical Processor ID: 0
[    0.160010] CPU: Processor Core ID: 1
[    0.160010] CPU1: Thermal monitoring enabled (TM2)
[    0.240015] CPU1: Intel(R) Core(TM)2 Quad CPU           @ 2.40GHz stepping 07
[    0.240015] checking TSC synchronization [CPU#0 -> CPU#1]: passed.
[    0.244015] Booting processor 2/2 ip 6000
[    0.252015] Initializing CPU#2
[    0.252015] Calibrating delay using timer specific routine.. 4811.00 BogoMIPS (lpj=9622005)
[    0.252015] CPU: L1 I cache: 32K, L1 D cache: 32K
[    0.252015] CPU: L2 cache: 4096K
[    0.252015] CPU 2/2 -> Node 0
[    0.252015] CPU: Physical Processor ID: 0
[    0.252015] CPU: Processor Core ID: 2
[    0.252015] CPU2: Thermal monitoring enabled (TM2)
[    0.331645] CPU2: Intel(R) Core(TM)2 Quad CPU           @ 2.40GHz stepping 07
[    0.331664] checking TSC synchronization [CPU#0 -> CPU#2]: passed.
[    0.336021] Booting processor 3/3 ip 6000
[    0.347625] Initializing CPU#3
[    0.347625] Calibrating delay using timer specific routine.. 4853.64 BogoMIPS (lpj=9707299)
[    0.347625] CPU: L1 I cache: 32K, L1 D cache: 32K
[    0.347625] CPU: L2 cache: 4096K
[    0.347625] CPU 3/3 -> Node 0
[    0.347625] CPU: Physical Processor ID: 0
[    0.347625] CPU: Processor Core ID: 3
[    0.347625] CPU3: Thermal monitoring enabled (TM2)
[    0.424043] CPU3: Intel(R) Core(TM)2 Quad CPU           @ 2.40GHz stepping 07
[    0.424061] checking TSC synchronization [CPU#0 -> CPU#3]: passed.
[    0.428290] Brought up 4 CPUs
[    0.428290] Total of 4 processors activated (19326.95 BogoMIPS).
[    0.428290] CPU0 attaching sched-domain:
[    0.428290]  domain 0: span 0-1
[    0.428290]   groups: 0 1
[    0.428290]   domain 1: span 0-3
[    0.428290]    groups: 0-1 2-3
[    0.428290]    domain 2: span 0-3
[    0.428290]     groups: 0-3
[    0.428290] CPU1 attaching sched-domain:
[    0.428290]  domain 0: span 0-1
[    0.428290]   groups: 1 0
[    0.428290]   domain 1: span 0-3
[    0.428290]    groups: 0-1 2-3
[    0.428290]    domain 2: span 0-3
[    0.428290]     groups: 0-3
[    0.428290] CPU2 attaching sched-domain:
[    0.428290]  domain 0: span 2-3
[    0.428290]   groups: 2 3
[    0.428290]   domain 1: span 0-3
[    0.428290]    groups: 2-3 0-1
[    0.428290]    domain 2: span 0-3
[    0.428290]     groups: 0-3
[    0.428290] CPU3 attaching sched-domain:
[    0.428290]  domain 0: span 2-3
[    0.428290]   groups: 3 2
[    0.428290]   domain 1: span 0-3
[    0.428290]    groups: 2-3 0-1
[    0.428290]    domain 2: span 0-3
[    0.428290]     groups: 0-3
[    0.428290] net_namespace: 1224 bytes
[    0.428290] Booting paravirtualized kernel on bare hardware
[    0.428290] NET: Registered protocol family 16
[    0.428290] ACPI: bus type pci registered
[    0.428290] PCI: MCFG configuration 0: base e0000000 segment 0 buses 0 - 255
[    0.428290] PCI: Not using MMCONFIG.
[    0.428290] PCI: Using configuration type 1 for base access
[    0.428290] ACPI: EC: Look up EC in DSDT
[    0.442132] ACPI: Interpreter enabled
[    0.442134] ACPI: (supports S0 S1 S3 S4 S5)
[    0.442171] ACPI: Using IOAPIC for interrupt routing
[    0.442223] PCI: MCFG configuration 0: base e0000000 segment 0 buses 0 - 255
[    0.445747] PCI: MCFG area at e0000000 reserved in ACPI motherboard resources
[    0.457787] PCI: Using MMCONFIG at e0000000 - efffffff
[    0.468107] ACPI: PCI Root Bridge [PCI0] (0000:00)
[    0.468851] pci 0000:00:1f.0: quirk: region 0800-087f claimed by ICH6 ACPI/GPIO/TCO
[    0.468855] pci 0000:00:1f.0: quirk: region 0480-04bf claimed by ICH6 GPIO
[    0.470581] PCI: Transparent bridge - 0000:00:1e.0
[    0.470809] ACPI: PCI Interrupt Routing Table [\_SB_.PCI0._PRT]
[    0.471412] ACPI: PCI Interrupt Routing Table [\_SB_.PCI0.P0P2._PRT]
[    0.471519] ACPI: PCI Interrupt Routing Table [\_SB_.PCI0.P0P1._PRT]
[    0.471717] ACPI: PCI Interrupt Routing Table [\_SB_.PCI0.P0P8._PRT]
[    0.471826] ACPI: PCI Interrupt Routing Table [\_SB_.PCI0.P0P9._PRT]
[    0.471946] ACPI: PCI Interrupt Routing Table [\_SB_.PCI0.P0P4._PRT]
[    0.489478] ACPI: PCI Interrupt Link [LNKA] (IRQs 3 4 5 6 7 10 *11 12 14 15)
[    0.489649] ACPI: PCI Interrupt Link [LNKB] (IRQs 3 4 5 6 7 *10 11 12 14 15)
[    0.489793] ACPI: PCI Interrupt Link [LNKC] (IRQs 3 4 *5 6 7 10 11 12 14 15)
[    0.489907] ACPI: PCI Interrupt Link [LNKD] (IRQs 3 *4 5 6 7 10 11 12 14 15)
[    0.490021] ACPI: PCI Interrupt Link [LNKE] (IRQs 3 4 5 6 7 10 11 12 14 15) *0, disabled.
[    0.490135] ACPI: PCI Interrupt Link [LNKF] (IRQs *3 4 5 6 7 10 11 12 14 15)
[    0.491031] ACPI: PCI Interrupt Link [LNKG] (IRQs 3 4 5 6 7 10 11 12 14 *15)
[    0.491145] ACPI: PCI Interrupt Link [LNKH] (IRQs 3 4 5 6 *7 10 11 12 14 15)
[    0.494260] Linux Plug and Play Support v0.97 (c) Adam Belay
[    0.494260] pnp: PnP ACPI init
[    0.494260] ACPI: bus type pnp registered
[    0.494260] pnp 00:00: parse allocated resources
[    0.494260] pnp 00:00:   add io  0xcf8-0xcff flags 0x1
[    0.494260] pnp 00:00: Plug and Play ACPI device, IDs PNP0a08 PNP0a03 (active)
[    0.494260] pnp 00:01: parse allocated resources
[    0.494260] pnp 00:01:   add mem 0xfed14000-0xfed19fff flags 0x1
[    0.494260] pnp 00:01: PNP0c01: calling quirk_system_pci_resources+0x0/0x15c
[    0.494260] pnp 00:01: Plug and Play ACPI device, IDs PNP0c01 (active)
[    0.494260] pnp 00:02: parse allocated resources
[    0.494260] pnp 00:02:   add dma 4 flags 0x4
[    0.494260] pnp 00:02:   add io  0x0-0xf flags 0x1
[    0.494260] pnp 00:02:   add io  0x81-0x83 flags 0x1
[    0.494260] pnp 00:02:   add io  0x87-0x87 flags 0x1
[    0.494260] pnp 00:02:   add io  0x89-0x8b flags 0x1
[    0.494260] pnp 00:02:   add io  0x8f-0x8f flags 0x1
[    0.494260] pnp 00:02:   add io  0xc0-0xdf flags 0x1
[    0.494260] pnp 00:02: Plug and Play ACPI device, IDs PNP0200 (active)
[    0.494260] pnp 00:03: parse allocated resources
[    0.494260] pnp 00:03:   add io  0x70-0x71 flags 0x1
[    0.494260] pnp 00:03:   add irq 8 flags 0x1
[    0.494260] pnp 00:03: Plug and Play ACPI device, IDs PNP0b00 (active)
[    0.494260] pnp 00:04: parse allocated resources
[    0.494260] pnp 00:04:   add io  0x61-0x61 flags 0x1
[    0.494260] pnp 00:04: Plug and Play ACPI device, IDs PNP0800 (active)
[    0.494260] pnp 00:05: parse allocated resources
[    0.494260] pnp 00:05:   add io  0xf0-0xff flags 0x1
[    0.494260] pnp 00:05:   add irq 13 flags 0x1
[    0.494260] pnp 00:05: Plug and Play ACPI device, IDs PNP0c04 (active)
[    0.494260] pnp 00:06: parse allocated resources
[    0.494260] pnp 00:06:   add io  0x0-0xffffffffffffffff flags 0x10000001
[    0.494260] pnp 00:06:   add io  0x0-0xffffffffffffffff flags 0x10000001
[    0.494260] pnp 00:06:   add io  0x290-0x297 flags 0x1
[    0.494260] pnp 00:06: PNP0c02: calling quirk_system_pci_resources+0x0/0x15c
[    0.494260] pnp 00:06: Plug and Play ACPI device, IDs PNP0c02 (active)
[    0.494260] pnp 00:07: parse allocated resources
[    0.494260] pnp 00:07:   add io  0x10-0x1f flags 0x1
[    0.494260] pnp 00:07:   add io  0x22-0x3f flags 0x1
[    0.494260] pnp 00:07:   add io  0x44-0x4d flags 0x1
[    0.494260] pnp 00:07:   add io  0x50-0x5f flags 0x1
[    0.494260] pnp 00:07:   add io  0x62-0x63 flags 0x1
[    0.494260] pnp 00:07:   add io  0x65-0x6f flags 0x1
[    0.494260] pnp 00:07:   add io  0x72-0x7f flags 0x1
[    0.494260] pnp 00:07:   add io  0x80-0x80 flags 0x1
[    0.494260] pnp 00:07:   add io  0x84-0x86 flags 0x1
[    0.494260] pnp 00:07:   add io  0x88-0x88 flags 0x1
[    0.494260] pnp 00:07:   add io  0x8c-0x8e flags 0x1
[    0.494260] pnp 00:07:   add io  0x90-0x9f flags 0x1
[    0.494260] pnp 00:07:   add io  0xa2-0xbf flags 0x1
[    0.494260] pnp 00:07:   add io  0xe0-0xef flags 0x1
[    0.494260] pnp 00:07:   add io  0x4d0-0x4d1 flags 0x1
[    0.494260] pnp 00:07:   add io  0x800-0x87f flags 0x1
[    0.494260] pnp 00:07:   add io  0x400-0x3ff flags 0x10000001
[    0.494260] pnp 00:07:   add io  0x480-0x4bf flags 0x1
[    0.494260] pnp 00:07:   add mem 0xfed1c000-0xfed1ffff flags 0x1
[    0.494260] pnp 00:07:   add mem 0xfed20000-0xfed3ffff flags 0x1
[    0.494260] pnp 00:07:   add mem 0xfed50000-0xfed8ffff flags 0x1
[    0.494260] pnp 00:07:   add mem 0xffa00000-0xffafffff flags 0x1
[    0.494260] pnp 00:07:   add mem 0xffb00000-0xffbfffff flags 0x1
[    0.494260] pnp 00:07:   add mem 0xffe00000-0xffefffff flags 0x1
[    0.494260] pnp 00:07:   add mem 0xfff00000-0xfffffffe flags 0x1
[    0.494260] pnp 00:07: PNP0c02: calling quirk_system_pci_resources+0x0/0x15c
[    0.494260] pnp 00:07: Plug and Play ACPI device, IDs PNP0c02 (active)
[    0.494260] pnp 00:08: parse allocated resources
[    0.494260] pnp 00:08:   add mem 0xfed00000-0xfed003ff flags 0x0
[    0.494260] pnp 00:08: Plug and Play ACPI device, IDs PNP0103 (active)
[    0.494260] pnp 00:09: parse allocated resources
[    0.494260] pnp 00:09:   add mem 0xfec00000-0xfec00fff flags 0x0
[    0.494260] pnp 00:09:   add mem 0xfee00000-0xfee00fff flags 0x0
[    0.494260] pnp 00:09: PNP0c02: calling quirk_system_pci_resources+0x0/0x15c
[    0.494260] pnp 00:09: Plug and Play ACPI device, IDs PNP0c02 (active)
[    0.494260] pnp 00:0a: parse allocated resources
[    0.494260] pnp 00:0a:   add io  0x60-0x60 flags 0x1
[    0.494260] pnp 00:0a:   add io  0x64-0x64 flags 0x1
[    0.494260] pnp 00:0a:   add irq 1 flags 0x1
[    0.494260] pnp 00:0a: Plug and Play ACPI device, IDs PNP0303 PNP030b (active)
[    0.494260] pnp 00:0b: parse allocated resources
[    0.494260] pnp 00:0b:   add mem 0xe0000000-0xefffffff flags 0x0
[    0.494260] pnp 00:0b: PNP0c02: calling quirk_system_pci_resources+0x0/0x15c
[    0.494260] pnp 00:0b: Plug and Play ACPI device, IDs PNP0c02 (active)
[    0.494260] pnp 00:0c: parse allocated resources
[    0.494260] pnp 00:0c:   add mem 0x0-0x9ffff flags 0x1
[    0.494260] pnp 00:0c:   add mem 0xc0000-0xcffff flags 0x0
[    0.494260] pnp 00:0c:   add mem 0xe0000-0xfffff flags 0x0
[    0.494260] pnp 00:0c:   add mem 0x100000-0xcfffffff flags 0x1
[    0.494260] pnp 00:0c:   add mem 0x0-0xffffffffffffffff flags 0x10000000
[    0.494260] pnp 00:0c: PNP0c01: calling quirk_system_pci_resources+0x0/0x15c
[    0.494260] pnp 00:0c: Plug and Play ACPI device, IDs PNP0c01 (active)
[    0.494580] pnp: PnP ACPI: found 13 devices
[    0.494582] ACPI: ACPI bus type pnp unregistered
[    0.498261] usbcore: registered new interface driver usbfs
[    0.498261] usbcore: registered new interface driver hub
[    0.498261] usbcore: registered new device driver usb
[    0.498261] PCI: Using ACPI for IRQ routing
[    0.517244] PCI-GART: No AMD northbridge found.
[    0.517250] hpet0: at MMIO 0xfed00000, IRQs 2, 8, 0, 0
[    0.517255] hpet0: 4 64-bit timers, 14318180 Hz
[    0.518262] ACPI: RTC can wake from S4
[    0.521171] Switched to high resolution mode on CPU 0
[    0.521836] Switched to high resolution mode on CPU 1
[    0.523590] Switched to high resolution mode on CPU 2
[    0.524768] Switched to high resolution mode on CPU 3
[    0.529149] pnp: the driver 'system' has been registered
[    0.529161] system 00:01: iomem range 0xfed14000-0xfed19fff has been reserved
[    0.529164] system 00:01: driver attached
[    0.529173] system 00:06: ioport range 0x290-0x297 has been reserved
[    0.529176] system 00:06: driver attached
[    0.529182] system 00:07: ioport range 0x4d0-0x4d1 has been reserved
[    0.529185] system 00:07: ioport range 0x800-0x87f has been reserved
[    0.529188] system 00:07: ioport range 0x480-0x4bf has been reserved
[    0.529192] system 00:07: iomem range 0xfed1c000-0xfed1ffff has been reserved
[    0.529196] system 00:07: iomem range 0xfed20000-0xfed3ffff has been reserved
[    0.529199] system 00:07: iomem range 0xfed50000-0xfed8ffff has been reserved
[    0.529202] system 00:07: iomem range 0xffa00000-0xffafffff has been reserved
[    0.529206] system 00:07: iomem range 0xffb00000-0xffbfffff has been reserved
[    0.529209] system 00:07: iomem range 0xffe00000-0xffefffff has been reserved
[    0.529213] system 00:07: iomem range 0xfff00000-0xfffffffe could not be reserved
[    0.529215] system 00:07: driver attached
[    0.529222] system 00:09: iomem range 0xfec00000-0xfec00fff has been reserved
[    0.529226] system 00:09: iomem range 0xfee00000-0xfee00fff could not be reserved
[    0.529229] system 00:09: driver attached
[    0.529236] system 00:0b: iomem range 0xe0000000-0xefffffff could not be reserved
[    0.529239] system 00:0b: driver attached
[    0.529245] system 00:0c: iomem range 0x0-0x9ffff could not be reserved
[    0.529249] system 00:0c: iomem range 0xc0000-0xcffff has been reserved
[    0.529252] system 00:0c: iomem range 0xe0000-0xfffff could not be reserved
[    0.529255] system 00:0c: iomem range 0x100000-0xcfffffff could not be reserved
[    0.529258] system 00:0c: driver attached
[    0.530253] PCI: Bridge: 0000:00:01.0
[    0.530253]   IO window: c000-cfff
[    0.530253]   MEM window: 0xfa000000-0xfe8fffff
[    0.530253]   PREFETCH window: 0x00000000d0000000-0x00000000dfffffff
[    0.530253] PCI: Bridge: 0000:00:1c.0
[    0.530253]   IO window: disabled.
[    0.530253]   MEM window: disabled.
[    0.530253]   PREFETCH window: 0x00000000eff00000-0x00000000efffffff
[    0.530253] PCI: Bridge: 0000:00:1c.4
[    0.530253]   IO window: d000-dfff
[    0.530253]   MEM window: 0xfea00000-0xfeafffff
[    0.530253]   PREFETCH window: disabled.
[    0.530253] PCI: Bridge: 0000:00:1c.5
[    0.530253]   IO window: disabled.
[    0.530253]   MEM window: 0xfe900000-0xfe9fffff
[    0.530253]   PREFETCH window: disabled.
[    0.530253] PCI: Bridge: 0000:05:02.0
[    0.530253]   IO window: disabled.
[    0.530253]   MEM window: disabled.
[    0.530253]   PREFETCH window: 0x00000000f0000000-0x00000000f7ffffff
[    0.530253] PCI: Bridge: 0000:00:1e.0
[    0.530253]   IO window: e000-efff
[    0.530253]   MEM window: 0xfeb00000-0xfebfffff
[    0.530253]   PREFETCH window: 0x00000000f0000000-0x00000000f7ffffff
[    0.530253] ACPI: PCI Interrupt 0000:00:01.0[A] -> GSI 16 (level, low) -> IRQ 16
[    0.530253] PCI: Setting latency timer of device 0000:00:01.0 to 64
[    0.530253] ACPI: PCI Interrupt 0000:00:1c.0[A] -> GSI 17 (level, low) -> IRQ 17
[    0.530253] PCI: Setting latency timer of device 0000:00:1c.0 to 64
[    0.530253] ACPI: PCI Interrupt 0000:00:1c.4[A] -> GSI 17 (level, low) -> IRQ 17
[    0.530253] PCI: Setting latency timer of device 0000:00:1c.4 to 64
[    0.530253] ACPI: PCI Interrupt 0000:00:1c.5[B] -> GSI 16 (level, low) -> IRQ 16
[    0.530253] PCI: Setting latency timer of device 0000:00:1c.5 to 64
[    0.530253] PCI: Setting latency timer of device 0000:00:1e.0 to 64
[    0.530253] NET: Registered protocol family 2
[    0.577123] IP route cache hash table entries: 262144 (order: 9, 2097152 bytes)
[    0.577123] TCP established hash table entries: 524288 (order: 11, 8388608 bytes)
[    0.581992] TCP bind hash table entries: 65536 (order: 8, 1048576 bytes)
[    0.581992] TCP: Hash tables configured (established 524288 bind 65536)
[    0.581992] TCP reno registered
[    0.595229] NET: Registered protocol family 1
[    0.595229] checking if image is initramfs... it is
[    1.187643] Freeing initrd memory: 8681k freed
[    1.199197] audit: initializing netlink socket (disabled)
[    1.199197] type=2000 audit(1234114909.180:1): initialized
[    1.199326] Total HugeTLB memory allocated, 0
[    1.199326] VFS: Disk quotas dquot_6.5.1
[    1.199326] Dquot-cache hash table entries: 512 (order 0, 4096 bytes)
[    1.199326] msgmni has been set to 11975
[    1.199326] Block layer SCSI generic (bsg) driver version 0.4 loaded (major 253)
[    1.199326] io scheduler noop registered
[    1.199326] io scheduler anticipatory registered
[    1.199326] io scheduler deadline registered
[    1.199326] io scheduler cfq registered (default)
[    1.199326] pci 0000:01:00.0: Boot video device
[    1.199326] PCI: Setting latency timer of device 0000:00:01.0 to 64
[    1.199326] assign_interrupt_mode Found MSI capability
[    1.199326] Allocate Port Service[0000:00:01.0:pcie00]
[    1.199326] Allocate Port Service[0000:00:01.0:pcie03]
[    1.199326] PCI: Setting latency timer of device 0000:00:1c.0 to 64
[    1.199326] assign_interrupt_mode Found MSI capability
[    1.199326] Allocate Port Service[0000:00:1c.0:pcie00]
[    1.199326] Allocate Port Service[0000:00:1c.0:pcie02]
[    1.199326] Allocate Port Service[0000:00:1c.0:pcie03]
[    1.199326] PCI: Setting latency timer of device 0000:00:1c.4 to 64
[    1.199326] assign_interrupt_mode Found MSI capability
[    1.199326] Allocate Port Service[0000:00:1c.4:pcie00]
[    1.199326] Allocate Port Service[0000:00:1c.4:pcie02]
[    1.199326] Allocate Port Service[0000:00:1c.4:pcie03]
[    1.199326] PCI: Setting latency timer of device 0000:00:1c.5 to 64
[    1.199326] assign_interrupt_mode Found MSI capability
[    1.199326] Allocate Port Service[0000:00:1c.5:pcie00]
[    1.199326] Allocate Port Service[0000:00:1c.5:pcie02]
[    1.199326] Allocate Port Service[0000:00:1c.5:pcie03]
[    1.376076] hpet_resources: 0xfed00000 is busy
[    1.376076] Linux agpgart interface v0.103
[    1.376076] Serial: 8250/16550 driver $Revision: 1.90 $ 4 ports, IRQ sharing enabled
[    1.376076] pnp: the driver 'serial' has been registered
[    1.376077] brd: module loaded
[    1.376077] input: Macintosh mouse button emulation as /class/input/input0
[    1.376077] pnp: the driver 'i8042 kbd' has been registered
[    1.376077] i8042 kbd 00:0a: driver attached
[    1.376077] pnp: the driver 'i8042 aux' has been registered
[    1.376077] PNP: PS/2 Controller [PNP0303:PS2K] at 0x60,0x64 irq 1
[    1.376077] PNP: PS/2 appears to have AUX port disabled, if this is incorrect please boot with i8042.nopnp
[    1.376077] serio: i8042 KBD port at 0x60,0x64 irq 1
[    1.399968] mice: PS/2 mouse device common for all mice
[    1.399968] pnp: the driver 'rtc_cmos' has been registered
[    1.399968] rtc_cmos 00:03: rtc core: registered rtc_cmos as rtc0
[    1.399968] rtc0: alarms up to one month, y3k
[    1.399968] rtc_cmos 00:03: driver attached
[    1.399968] cpuidle: using governor ladder
[    1.399968] cpuidle: using governor menu
[    1.399968] No iBFT detected.
[    1.399968] TCP cubic registered
[    1.399968] NET: Registered protocol family 17
[    1.399968] registered taskstats version 1
[    1.399968] rtc_cmos 00:03: setting system clock to 2009-02-08 17:41:49 UTC (1234114909)
[    1.399968] Freeing unused kernel memory: 392k freed
[    1.424678] input: AT Translated Set 2 keyboard as /class/input/input1
[    1.477565] ACPI: SSDT CFF8E0D0, 01D2 (r1    AMI   CPU1PM        1 INTL 20060113)
[    1.480491] ACPI: ACPI0007:00 is registered as cooling_device0
[    1.480491] ACPI: SSDT CFF8E2B0, 0143 (r1    AMI   CPU2PM        1 INTL 20060113)
[    1.480491] ACPI: ACPI0007:01 is registered as cooling_device1
[    1.480491] ACPI: SSDT CFF8E400, 0143 (r1    AMI   CPU3PM        1 INTL 20060113)
[    1.480491] ACPI: ACPI0007:02 is registered as cooling_device2
[    1.480491] ACPI: SSDT CFF8E550, 0143 (r1    AMI   CPU4PM        1 INTL 20060113)
[    1.480491] ACPI: ACPI0007:03 is registered as cooling_device3
[    1.596989] USB Universal Host Controller Interface driver v3.0
[    1.596989] ACPI: PCI Interrupt 0000:00:1a.0[A] -> GSI 16 (level, low) -> IRQ 16
[    1.596989] PCI: Setting latency timer of device 0000:00:1a.0 to 64
[    1.596989] uhci_hcd 0000:00:1a.0: UHCI Host Controller
[    1.596989] uhci_hcd 0000:00:1a.0: new USB bus registered, assigned bus number 1
[    1.596989] uhci_hcd 0000:00:1a.0: irq 16, io base 0x0000b800
[    1.596989] usb usb1: configuration #1 chosen from 1 choice
[    1.596989] hub 1-0:1.0: USB hub found
[    1.596989] hub 1-0:1.0: 2 ports detected
[    1.653713] Uniform Multi-Platform E-IDE driver
[    1.653713] ide: Assuming 33MHz system bus speed for PIO modes; override with idebus=xx
[    1.653713] No dock devices found.
[    1.657713] SCSI subsystem initialized
[    1.661847] libata version 3.00 loaded.
[    1.704762] usb usb1: New USB device found, idVendor=1d6b, idProduct=0001
[    1.704765] usb usb1: New USB device strings: Mfr=3, Product=2, SerialNumber=1
[    1.704767] usb usb1: Product: UHCI Host Controller
[    1.704769] usb usb1: Manufacturer: Linux 2.6.26-1-amd64 uhci_hcd
[    1.704770] usb usb1: SerialNumber: 0000:00:1a.0
[    1.706649] ACPI: PCI Interrupt 0000:00:1a.1[B] -> GSI 21 (level, low) -> IRQ 21
[    1.706649] PCI: Setting latency timer of device 0000:00:1a.1 to 64
[    1.706649] uhci_hcd 0000:00:1a.1: UHCI Host Controller
[    1.706649] uhci_hcd 0000:00:1a.1: new USB bus registered, assigned bus number 2
[    1.706649] uhci_hcd 0000:00:1a.1: irq 21, io base 0x0000b880
[    1.706649] usb usb2: configuration #1 chosen from 1 choice
[    1.706649] hub 2-0:1.0: USB hub found
[    1.706649] hub 2-0:1.0: 2 ports detected
[    1.812149] usb usb2: New USB device found, idVendor=1d6b, idProduct=0001
[    1.812153] usb usb2: New USB device strings: Mfr=3, Product=2, SerialNumber=1
[    1.812156] usb usb2: Product: UHCI Host Controller
[    1.812158] usb usb2: Manufacturer: Linux 2.6.26-1-amd64 uhci_hcd
[    1.812161] usb usb2: SerialNumber: 0000:00:1a.1
[    1.813962] ACPI: PCI Interrupt 0000:00:1a.2[C] -> GSI 18 (level, low) -> IRQ 18
[    1.813962] PCI: Setting latency timer of device 0000:00:1a.2 to 64
[    1.813962] uhci_hcd 0000:00:1a.2: UHCI Host Controller
[    1.813962] uhci_hcd 0000:00:1a.2: new USB bus registered, assigned bus number 3
[    1.813962] uhci_hcd 0000:00:1a.2: irq 18, io base 0x0000bc00
[    1.813962] usb usb3: configuration #1 chosen from 1 choice
[    1.813962] hub 3-0:1.0: USB hub found
[    1.813962] hub 3-0:1.0: 2 ports detected
[    1.944886] usb usb3: New USB device found, idVendor=1d6b, idProduct=0001
[    1.944889] usb usb3: New USB device strings: Mfr=3, Product=2, SerialNumber=1
[    1.944892] usb usb3: Product: UHCI Host Controller
[    1.944894] usb usb3: Manufacturer: Linux 2.6.26-1-amd64 uhci_hcd
[    1.944896] usb usb3: SerialNumber: 0000:00:1a.2
[    1.949961] ACPI: PCI Interrupt 0000:00:1a.7[C] -> GSI 18 (level, low) -> IRQ 18
[    1.949961] PCI: Setting latency timer of device 0000:00:1a.7 to 64
[    1.949961] ehci_hcd 0000:00:1a.7: EHCI Host Controller
[    1.949961] ehci_hcd 0000:00:1a.7: new USB bus registered, assigned bus number 4
[    1.957946] ehci_hcd 0000:00:1a.7: debug port 1
[    1.957946] PCI: cache line size of 32 is not supported by device 0000:00:1a.7
[    1.957946] ehci_hcd 0000:00:1a.7: irq 18, io mem 0xf9fffc00
[    1.969966] ehci_hcd 0000:00:1a.7: USB 2.0 started, EHCI 1.00, driver 10 Dec 2004
[    1.969966] usb usb4: configuration #1 chosen from 1 choice
[    1.969966] hub 4-0:1.0: USB hub found
[    1.969966] hub 4-0:1.0: 6 ports detected
[    2.120490] usb usb4: New USB device found, idVendor=1d6b, idProduct=0002
[    2.120494] usb usb4: New USB device strings: Mfr=3, Product=2, SerialNumber=1
[    2.120497] usb usb4: Product: EHCI Host Controller
[    2.120499] usb usb4: Manufacturer: Linux 2.6.26-1-amd64 ehci_hcd
[    2.120501] usb usb4: SerialNumber: 0000:00:1a.7
[    2.122100] ACPI: PCI Interrupt 0000:00:1d.0[A] -> GSI 23 (level, low) -> IRQ 23
[    2.122100] PCI: Setting latency timer of device 0000:00:1d.0 to 64
[    2.122100] uhci_hcd 0000:00:1d.0: UHCI Host Controller
[    2.122100] uhci_hcd 0000:00:1d.0: new USB bus registered, assigned bus number 5
[    2.122100] uhci_hcd 0000:00:1d.0: irq 23, io base 0x0000b080
[    2.122100] usb usb5: configuration #1 chosen from 1 choice
[    2.122100] hub 5-0:1.0: USB hub found
[    2.122100] hub 5-0:1.0: 2 ports detected
[    2.248513] usb usb5: New USB device found, idVendor=1d6b, idProduct=0001
[    2.248516] usb usb5: New USB device strings: Mfr=3, Product=2, SerialNumber=1
[    2.248519] usb usb5: Product: UHCI Host Controller
[    2.248521] usb usb5: Manufacturer: Linux 2.6.26-1-amd64 uhci_hcd
[    2.248524] usb usb5: SerialNumber: 0000:00:1d.0
[    2.250070] ACPI: PCI Interrupt 0000:00:1d.1[B] -> GSI 19 (level, low) -> IRQ 19
[    2.250070] PCI: Setting latency timer of device 0000:00:1d.1 to 64
[    2.250070] uhci_hcd 0000:00:1d.1: UHCI Host Controller
[    2.250070] uhci_hcd 0000:00:1d.1: new USB bus registered, assigned bus number 6
[    2.250070] uhci_hcd 0000:00:1d.1: irq 19, io base 0x0000b400
[    2.250070] usb usb6: configuration #1 chosen from 1 choice
[    2.250070] hub 6-0:1.0: USB hub found
[    2.250070] hub 6-0:1.0: 2 ports detected
[    2.378035] usb usb6: New USB device found, idVendor=1d6b, idProduct=0001
[    2.378039] usb usb6: New USB device strings: Mfr=3, Product=2, SerialNumber=1
[    2.378042] usb usb6: Product: UHCI Host Controller
[    2.378044] usb usb6: Manufacturer: Linux 2.6.26-1-amd64 uhci_hcd
[    2.378046] usb usb6: SerialNumber: 0000:00:1d.1
[    2.379606] ACPI: PCI Interrupt 0000:00:1d.2[C] -> GSI 18 (level, low) -> IRQ 18
[    2.379606] PCI: Setting latency timer of device 0000:00:1d.2 to 64
[    2.379606] uhci_hcd 0000:00:1d.2: UHCI Host Controller
[    2.379606] uhci_hcd 0000:00:1d.2: new USB bus registered, assigned bus number 7
[    2.379606] uhci_hcd 0000:00:1d.2: irq 18, io base 0x0000b480
[    2.379606] usb usb7: configuration #1 chosen from 1 choice
[    2.379606] hub 7-0:1.0: USB hub found
[    2.379606] hub 7-0:1.0: 2 ports detected
[    2.515666] usb usb7: New USB device found, idVendor=1d6b, idProduct=0001
[    2.515669] usb usb7: New USB device strings: Mfr=3, Product=2, SerialNumber=1
[    2.515672] usb usb7: Product: UHCI Host Controller
[    2.515674] usb usb7: Manufacturer: Linux 2.6.26-1-amd64 uhci_hcd
[    2.515677] usb usb7: SerialNumber: 0000:00:1d.2
[    2.515688] ACPI: PCI Interrupt 0000:00:1d.7[A] -> GSI 23 (level, low) -> IRQ 23
[    2.515688] PCI: Setting latency timer of device 0000:00:1d.7 to 64
[    2.515688] ehci_hcd 0000:00:1d.7: EHCI Host Controller
[    2.515688] ehci_hcd 0000:00:1d.7: new USB bus registered, assigned bus number 8
[    2.531823] ehci_hcd 0000:00:1d.7: debug port 1
[    2.531823] PCI: cache line size of 32 is not supported by device 0000:00:1d.7
[    2.531823] ehci_hcd 0000:00:1d.7: irq 23, io mem 0xf9fff800
[    2.567648] ehci_hcd 0000:00:1d.7: USB 2.0 started, EHCI 1.00, driver 10 Dec 2004
[    2.567648] usb usb8: configuration #1 chosen from 1 choice
[    2.567648] hub 8-0:1.0: USB hub found
[    2.567648] hub 8-0:1.0: 6 ports detected
[    2.671676] usb usb8: New USB device found, idVendor=1d6b, idProduct=0002
[    2.671679] usb usb8: New USB device strings: Mfr=3, Product=2, SerialNumber=1
[    2.671682] usb usb8: Product: EHCI Host Controller
[    2.671684] usb usb8: Manufacturer: Linux 2.6.26-1-amd64 ehci_hcd
[    2.671687] usb usb8: SerialNumber: 0000:00:1d.7
[    2.671863] ahci 0000:00:1f.2: version 3.0
[    2.671863] ACPI: PCI Interrupt 0000:00:1f.2[B] -> GSI 22 (level, low) -> IRQ 22
[    2.834274] usb 2-1: new low speed USB device using uhci_hcd and address 2
[    2.968763] usb 2-1: configuration #1 chosen from 1 choice
[    3.234256] usb 2-1: New USB device found, idVendor=051d, idProduct=0002
[    3.234256] usb 2-1: New USB device strings: Mfr=3, Product=1, SerialNumber=2
[    3.234256] usb 2-1: Product: Smart-UPS 1500 FW:601.3.D USB FW:8.1
[    3.234256] usb 2-1: Manufacturer: American Power Conversion
[    3.234256] usb 2-1: SerialNumber: AS0719222574
[    3.897902] usbcore: registered new interface driver hiddev
[    3.907306] ahci 0000:00:1f.2: AHCI 0001.0200 32 slots 4 ports 3 Gbps 0x33 impl SATA mode
[    3.907311] ahci 0000:00:1f.2: flags: 64bit ncq sntf stag pm led clo pmp pio slum part 
[    3.907316] PCI: Setting latency timer of device 0000:00:1f.2 to 64
[    3.911275] scsi0 : ahci
[    3.911275] scsi1 : ahci
[    3.911275] scsi2 : ahci
[    3.911275] scsi3 : ahci
[    3.911275] scsi4 : ahci
[    3.911275] scsi5 : ahci
[    3.911275] ata1: SATA max UDMA/133 abar m2048@0xf9ffe800 port 0xf9ffe900 irq 1275
[    3.911275] ata2: SATA max UDMA/133 abar m2048@0xf9ffe800 port 0xf9ffe980 irq 1275
[    3.911275] ata3: DUMMY
[    3.911275] ata4: DUMMY
[    3.911275] ata5: SATA max UDMA/133 abar m2048@0xf9ffe800 port 0xf9ffeb00 irq 1275
[    3.911275] ata6: SATA max UDMA/133 abar m2048@0xf9ffe800 port 0xf9ffeb80 irq 1275
[    4.522988] usb 7-1: new low speed USB device using uhci_hcd and address 2
[    4.837395] usb 7-1: configuration #1 chosen from 1 choice
[    4.837395] usb 7-1: New USB device found, idVendor=15c2, idProduct=ffdc
[    4.837395] usb 7-1: New USB device strings: Mfr=0, Product=0, SerialNumber=0
[    5.029494] ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[    5.041541] ata1.00: HPA detected: current 976771055, native 976773168
[    5.041541] ata1.00: ATA-7: WDC WD5000AAKS-00TMA0, 12.01C01, max UDMA/133
[    5.041541] ata1.00: 976771055 sectors, multi 0: LBA48 NCQ (depth 31/32)
[    5.041541] ata1.00: configured for UDMA/133
[    5.188625] usb 7-2: new low speed USB device using uhci_hcd and address 3
[    5.365313] usb 7-2: configuration #1 chosen from 1 choice
[    5.368267] usb 7-2: New USB device found, idVendor=0b38, idProduct=0010
[    5.368270] usb 7-2: New USB device strings: Mfr=0, Product=0, SerialNumber=0
[    5.871651] ata2: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[    5.871651] ata2.00: ATA-8: WDC WD5000AAKS-00YGA0, 12.01C02, max UDMA/133
[    5.871651] ata2.00: 976773168 sectors, multi 0: LBA48 NCQ (depth 31/32)
[    5.875683] ata2.00: configured for UDMA/133
[    6.378722] ata5: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[    5.871651] ata5.00: ATA-8: WDC WD5000AAKS-00YGA0, 12.01C02, max UDMA/133
[    5.871651] ata5.00: 976773168 sectors, multi 0: LBA48 NCQ (depth 31/32)
[    6.391641] ata5.00: configured for UDMA/133
[    6.942956] ata6: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[    6.947471] ata6.00: ATA-8: WDC WD5000AAKS-00YGA0, 12.01C02, max UDMA/133
[    6.947471] ata6.00: 976773168 sectors, multi 0: LBA48 NCQ (depth 31/32)
[    6.947471] ata6.00: configured for UDMA/133
[    6.947471] scsi 0:0:0:0: Direct-Access     ATA      WDC WD5000AAKS-0 12.0 PQ: 0 ANSI: 5
[    6.947471] scsi 1:0:0:0: Direct-Access     ATA      WDC WD5000AAKS-0 12.0 PQ: 0 ANSI: 5
[    6.947471] scsi 4:0:0:0: Direct-Access     ATA      WDC WD5000AAKS-0 12.0 PQ: 0 ANSI: 5
[    6.947471] scsi 5:0:0:0: Direct-Access     ATA      WDC WD5000AAKS-0 12.0 PQ: 0 ANSI: 5
[    7.190467] JMB: IDE controller (0x197b:0x2363 rev 0x03) at  PCI slot 0000:03:00.1
[    7.190467] ACPI: PCI Interrupt 0000:03:00.1[B] -> GSI 17 (level, low) -> IRQ 17
[    7.190467] JMB: 100% native mode on irq 17
[    7.190467]     ide0: BM-DMA at 0xd400-0xd407
[    7.190467]     ide1: BM-DMA at 0xd408-0xd40f
[    7.190467] Probing IDE interface ide0...
[    7.339303] hiddev96hidraw0: USB HID v1.10 Device [American Power Conversion Smart-UPS 1500 FW:601.3.D USB FW:8.1] on usb-0000:00:1a.1-1
[    7.351770] input: HID 0b38:0010 as /class/input/input2
[    7.371588] input,hidraw1: USB HID v1.10 Keyboard [HID 0b38:0010] on usb-0000:00:1d.2-2
[    7.392467] input: HID 0b38:0010 as /class/input/input3
[    7.408058] input,hidraw2: USB HID v1.10 Device [HID 0b38:0010] on usb-0000:00:1d.2-2
[    7.408082] usbcore: registered new interface driver usbhid
[    7.408085] usbhid: v2.6:USB HID core driver
[    7.967911] hda: PLEXTOR DVDR PX-760A, ATAPI CD/DVD-ROM drive
[    8.303604] hda: host max PIO5 wanted PIO255(auto-tune) selected PIO4
[    8.303673] hda: UDMA/66 mode selected
[    8.303744] Probing IDE interface ide1...
[    9.232484] ide0 at 0xdc00-0xdc07,0xd882 on irq 17
[    9.241583] ide1 at 0xd800-0xd807,0xd482 on irq 17
[    9.241583] ACPI: PCI Interrupt 0000:03:00.0[A] -> GSI 16 (level, low) -> IRQ 16
[   10.241657] ahci 0000:03:00.0: AHCI 0001.0000 32 slots 2 ports 3 Gbps 0x3 impl SATA mode
[   10.241661] ahci 0000:03:00.0: flags: 64bit ncq pm led clo pmp pio slum part 
[   10.241668] PCI: Setting latency timer of device 0000:03:00.0 to 64
[   10.245557] scsi6 : ahci
[   10.245557] scsi7 : ahci
[   10.245557] ata7: SATA max UDMA/133 abar m8192@0xfeafe000 port 0xfeafe100 irq 16
[   10.245557] ata8: SATA max UDMA/133 abar m8192@0xfeafe000 port 0xfeafe180 irq 16
[   10.906187] ata7: SATA link down (SStatus 0 SControl 300)
[   11.292125] ata8: SATA link down (SStatus 0 SControl 300)
[   11.320150] ACPI: PCI Interrupt 0000:02:00.0[A] -> GSI 17 (level, low) -> IRQ 17
[   11.320150] PCI: Setting latency timer of device 0000:02:00.0 to 64
[   11.320150] atl1 0000:02:00.0: version 2.1.3
[   11.904008] Driver 'sd' needs updating - please use bus_type methods
[   11.904008] sd 0:0:0:0: [sda] 976771055 512-byte hardware sectors (500107 MB)
[   11.904009] sd 0:0:0:0: [sda] Write Protect is off
[   11.904009] sd 0:0:0:0: [sda] Mode Sense: 00 3a 00 00
[   11.904009] sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
[   11.904009] sd 0:0:0:0: [sda] 976771055 512-byte hardware sectors (500107 MB)
[   11.904009] sd 0:0:0:0: [sda] Write Protect is off
[   11.904009] sd 0:0:0:0: [sda] Mode Sense: 00 3a 00 00
[   11.904009] sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
[   11.904009]  sda:<6>ACPI: PCI Interrupt 0000:05:03.0[A] -> GSI 16 (level, low) -> IRQ 16
[   11.918139]  sda1 sda2
[   11.918251] sd 0:0:0:0: [sda] Attached SCSI disk
[   11.918316] sd 1:0:0:0: [sdb] 976773168 512-byte hardware sectors (500108 MB)
[   11.918329] sd 1:0:0:0: [sdb] Write Protect is off
[   11.918331] sd 1:0:0:0: [sdb] Mode Sense: 00 3a 00 00
[   11.918350] sd 1:0:0:0: [sdb] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
[   11.918391] sd 1:0:0:0: [sdb] 976773168 512-byte hardware sectors (500108 MB)
[   11.918402] sd 1:0:0:0: [sdb] Write Protect is off
[   11.918404] sd 1:0:0:0: [sdb] Mode Sense: 00 3a 00 00
[   11.918423] sd 1:0:0:0: [sdb] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
[   11.918425]  sdb: sdb1 sdb2
[   11.935044] sd 1:0:0:0: [sdb] Attached SCSI disk
[   11.935096] sd 4:0:0:0: [sdc] 976773168 512-byte hardware sectors (500108 MB)
[   11.935108] sd 4:0:0:0: [sdc] Write Protect is off
[   11.935110] sd 4:0:0:0: [sdc] Mode Sense: 00 3a 00 00
[   11.935129] sd 4:0:0:0: [sdc] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
[   11.935164] sd 4:0:0:0: [sdc] 976773168 512-byte hardware sectors (500108 MB)
[   11.935175] sd 4:0:0:0: [sdc] Write Protect is off
[   11.935177] sd 4:0:0:0: [sdc] Mode Sense: 00 3a 00 00
[   11.935196] sd 4:0:0:0: [sdc] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
[   11.935198]  sdc: sdc1 sdc2
[   11.943259] sd 4:0:0:0: [sdc] Attached SCSI disk
[   11.943259] sd 5:0:0:0: [sdd] 976773168 512-byte hardware sectors (500108 MB)
[   11.943259] sd 5:0:0:0: [sdd] Write Protect is off
[   11.943259] sd 5:0:0:0: [sdd] Mode Sense: 00 3a 00 00
[   11.943259] sd 5:0:0:0: [sdd] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
[   11.943259] sd 5:0:0:0: [sdd] 976773168 512-byte hardware sectors (500108 MB)
[   11.943259] sd 5:0:0:0: [sdd] Write Protect is off
[   11.943259] sd 5:0:0:0: [sdd] Mode Sense: 00 3a 00 00
[   11.943259] sd 5:0:0:0: [sdd] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
[   11.943259]  sdd: sdd1 sdd2
[   11.960964] sd 5:0:0:0: [sdd] Attached SCSI disk
[   11.967109] ohci1394: fw-host0: OHCI-1394 1.1 (PCI): IRQ=[16]  MMIO=[febff800-febfffff]  Max Packet=[2048]  IR/IT contexts=[4/8]
[   11.994057] hda: ATAPI 40X DVD-ROM DVD-R CD-R/RW drive, 2048kB Cache
[   11.994057] Uniform CD-ROM driver Revision: 3.20
[   12.132930] md: raid1 personality registered for level 1
[   12.136816] xor: automatically using best checksumming function: generic_sse
[   12.157605]    generic_sse:  8587.000 MB/sec
[   12.157605] xor: using function: generic_sse (8587.000 MB/sec)
[   12.157605] async_tx: api initialized (async)
[   12.226046] raid6: int64x1   2255 MB/s
[   12.294046] raid6: int64x2   3042 MB/s
[   12.362086] raid6: int64x4   2669 MB/s
[   12.430086] raid6: int64x8   1725 MB/s
[   12.498087] raid6: sse2x1    3823 MB/s
[   12.566087] raid6: sse2x2    4355 MB/s
[   12.634088] raid6: sse2x4    7250 MB/s
[   12.634088] raid6: using algorithm sse2x4 (7250 MB/s)
[   12.634088] md: raid6 personality registered for level 6
[   12.634088] md: raid5 personality registered for level 5
[   12.634088] md: raid4 personality registered for level 4
[   12.638108] md: md0 stopped.
[   12.647060] md: bind<sdb1>
[   12.647060] md: bind<sda1>
[   12.658281] raid1: raid set md0 active with 2 out of 2 mirrors
[   12.658281] md: md1 stopped.
[   12.707292] md: bind<sdd1>
[   12.707129] md: bind<sdc1>
[   12.716865] raid1: raid set md1 active with 2 out of 2 mirrors
[   12.716865] md: md2 stopped.
[   12.778826] md: bind<sdb2>
[   12.779362] md: bind<sdc2>
[   12.781148] md: bind<sdd2>
[   12.781148] md: bind<sda2>
[   12.789087] raid5: device sda2 operational as raid disk 0
[   12.789089] raid5: device sdd2 operational as raid disk 3
[   12.789091] raid5: device sdc2 operational as raid disk 2
[   12.789092] raid5: device sdb2 operational as raid disk 1
[   12.791697] raid5: allocated 4274kB for md2
[   12.791697] raid5: raid level 5 set md2 active with 4 out of 4 devices, algorithm 2
[   12.791697] RAID5 conf printout:
[   12.791697]  --- rd:4 wd:4
[   12.791697]  disk 0, o:1, dev:sda2
[   12.791697]  disk 1, o:1, dev:sdb2
[   12.791697]  disk 2, o:1, dev:sdc2
[   12.791697]  disk 3, o:1, dev:sdd2
[   12.903039] device-mapper: uevent: version 1.0.3
[   12.903039] device-mapper: ioctl: 4.13.0-ioctl (2007-10-18) initialised: dm-devel@redhat.com
[   12.964067] PM: Starting manual resume from disk
[   13.055693] kjournald starting.  Commit interval 5 seconds
[   13.055693] EXT3-fs: mounted filesystem with ordered data mode.
[   13.286980] ieee1394: Host added: ID:BUS[0-00:1023]  GUID[0011d800018f90fd]
[   14.954097] udev: starting version 136
[   14.954097] udev: deprecated sysfs layout; update the kernel or disable CONFIG_SYSFS_DEPRECATED; some udev features will not work correctly
[   15.403737] input: Power Button (FF) as /class/input/input4
[   15.448661] ACPI: Power Button (FF) [PWRF]
[   15.448758] input: Power Button (CM) as /class/input/input5
[   15.515258] ACPI: Power Button (CM) [PWRB]
[   15.557826] input: PC Speaker as /class/input/input6
[   15.585410] pci_hotplug: PCI Hot Plug PCI Core version: 0.5
[   15.588707] shpchp: Standard Hot Plug PCI Controller Driver version: 0.4
[   15.640968] iTCO_wdt: Intel TCO WatchDog Timer Driver v1.03 (30-Apr-2008)
[   15.641047] iTCO_wdt: Found a ICH9 TCO device (Version=2, TCOBASE=0x0860)
[   15.641078] iTCO_wdt: initialized. heartbeat=30 sec (nowayout=0)
[   15.676124] ACPI: PCI Interrupt 0000:00:1f.3[C] -> GSI 18 (level, low) -> IRQ 18
[   15.837745] Linux video capture interface: v2.00
[   15.918394] ivtv:  Start initialization, version 1.3.0
[   15.918394] ivtv0: Initializing card #0
[   15.918394] ivtv0: Autodetected Hauppauge card (cx23416 based)
[   15.918394] ACPI: PCI Interrupt 0000:06:08.0[A] -> GSI 18 (level, low) -> IRQ 18
[   15.970787] tveeprom 1-0050: Hauppauge model 23552, rev D492, serial# 9396298
[   15.970790] tveeprom 1-0050: tuner model is Philips FQ1236A MK4 (idx 92, type 57)
[   15.970792] tveeprom 1-0050: TV standards NTSC(M) (eeprom 0x08)
[   15.970794] tveeprom 1-0050: second tuner model is Philips TEA5768HL FM Radio (idx 101, type 62)
[   15.970796] tveeprom 1-0050: audio processor is CX25843 (idx 37)
[   15.970798] tveeprom 1-0050: decoder processor is CX25843 (idx 30)
[   15.970800] tveeprom 1-0050: has radio, has no IR receiver, has no IR transmitter
[   15.970802] ivtv0: Autodetected WinTV PVR 500 (unit #1)
[   16.025414] cx25840 1-0044: cx25843-23 found @ 0x88 (ivtv i2c driver #0)
[   16.075968] ACPI: PCI Interrupt 0000:00:1b.0[A] -> GSI 22 (level, low) -> IRQ 22
[   16.075968] PCI: Setting latency timer of device 0000:00:1b.0 to 64
[   16.085710] tuner 1-0060: chip found @ 0xc0 (ivtv i2c driver #0)
[   16.085710] tea5767 1-0060: type set to Philips TEA5767HN FM Radio
[   16.107789] tuner 1-0043: chip found @ 0x86 (ivtv i2c driver #0)
[   16.111830] hda_codec: Unknown model for ALC883, trying auto-probe from BIOS...
[   16.128506] tda9887 1-0043: creating new instance
[   16.128506] tda9887 1-0043: tda988[5/6/7] found
[   16.128506] tuner 1-0061: chip found @ 0xc2 (ivtv i2c driver #0)
[   16.128506] wm8775 1-001b: chip found @ 0x36 (ivtv i2c driver #0)
[   16.154237] tuner-simple 1-0061: creating new instance
[   16.154237] tuner-simple 1-0061: type set to 57 (Philips FQ1236A MK4)
[   16.162279] ivtv0: Registered device video0 for encoder MPG (4096 kB)
[   16.162279] ivtv0: Registered device video32 for encoder YUV (2048 kB)
[   16.162279] ivtv0: Registered device vbi0 for encoder VBI (1024 kB)
[   16.162279] ivtv0: Registered device video24 for encoder PCM (320 kB)
[   16.162279] ivtv0: Registered device radio0 for encoder radio
[   16.162279] ivtv0: Initialized card #0: WinTV PVR 500 (unit #1)
[   16.162279] ivtv1: Initializing card #1
[   16.162279] ivtv1: Autodetected Hauppauge card (cx23416 based)
[   16.162279] ACPI: PCI Interrupt 0000:06:09.0[A] -> GSI 19 (level, low) -> IRQ 19
[   16.214206] tveeprom 2-0050: Hauppauge model 23552, rev D492, serial# 9396298
[   16.214206] tveeprom 2-0050: tuner model is Philips FQ1236A MK4 (idx 92, type 57)
[   16.214206] tveeprom 2-0050: TV standards NTSC(M) (eeprom 0x08)
[   16.214206] tveeprom 2-0050: second tuner model is Philips TEA5768HL FM Radio (idx 101, type 62)
[   16.214206] tveeprom 2-0050: audio processor is CX25843 (idx 37)
[   16.214206] tveeprom 2-0050: decoder processor is CX25843 (idx 30)
[   16.214206] tveeprom 2-0050: has radio, has no IR receiver, has no IR transmitter
[   16.214206] ivtv1: Correcting tveeprom data: no radio present on second unit
[   16.214206] ivtv1: Autodetected WinTV PVR 500 (unit #2)
[   16.244731] cx25840 2-0044: cx25843-23 found @ 0x88 (ivtv i2c driver #1)
[   16.254001] tuner 2-0043: chip found @ 0x86 (ivtv i2c driver #1)
[   16.254018] tda9887 2-0043: creating new instance
[   16.254020] tda9887 2-0043: tda988[5/6/7] found
[   16.257949] tuner 2-0061: chip found @ 0xc2 (ivtv i2c driver #1)
[   16.257969] wm8775 2-001b: chip found @ 0x36 (ivtv i2c driver #1)
[   16.266692] tuner-simple 2-0061: creating new instance
[   16.266694] tuner-simple 2-0061: type set to 57 (Philips FQ1236A MK4)
[   16.276481] ivtv1: Registered device video1 for encoder MPG (4096 kB)
[   16.276508] ivtv1: Registered device video33 for encoder YUV (2048 kB)
[   16.276525] ivtv1: Registered device vbi1 for encoder VBI (1024 kB)
[   16.276542] ivtv1: Registered device video25 for encoder PCM (320 kB)
[   16.276544] ivtv1: Initialized card #1: WinTV PVR 500 (unit #2)
[   16.276570] ivtv:  End initialization
[  297.202646] EXT3 FS on md0, internal journal
[  297.938039] loop: module loaded
[  298.001375] w83627ehf: Found W83627DHG chip at 0x290
[  298.020007] coretemp coretemp.0: Using relative temperature scale!
[  298.020045] coretemp coretemp.1: Using relative temperature scale!
[  298.020079] coretemp coretemp.2: Using relative temperature scale!
[  298.020110] coretemp coretemp.3: Using relative temperature scale!
[ 4599.205244] fuse init (API version 7.9)
[ 4599.623123] kjournald starting.  Commit interval 5 seconds
[ 4599.652504] EXT3 FS on md1, internal journal
[ 4599.652504] EXT3-fs: mounted filesystem with ordered data mode.
[ 4599.749170] kjournald starting.  Commit interval 5 seconds
[ 4599.782877] EXT3 FS on dm-1, internal journal
[ 4599.782877] EXT3-fs: mounted filesystem with ordered data mode.
[ 4600.001137] kjournald2 starting.  Commit interval 5 seconds
[ 4600.029133] EXT4 FS on dm-2, internal journal
[ 4600.029133] EXT4-fs: mounted filesystem with ordered data mode.
[ 4600.029133] EXT4-fs: file extents enabled
[ 4600.033741] EXT4-fs: mballoc enabled
[ 4600.152047] Adding 6291448k swap on /dev/mapper/MainVG-Swap.  Priority:-1 extents:1 across:6291448k
[ 4611.159518] atl1 0000:02:00.0: eth0 link is up 100 Mbps full duplex
[ 4611.159518] atl1 0000:02:00.0: eth0 link is up 1000 Mbps full duplex
[ 4612.636342] NET: Registered protocol family 10
[ 4612.636342] lo: Disabled Privacy Extensions
[ 4613.573786] RPC: Registered udp transport module.
[ 4613.573786] RPC: Registered tcp transport module.
[ 4613.757806] Installing knfsd (copyright (C) 1996 okir@monad.swb.de).
[ 4614.522350] ttyS0: LSR safety check engaged!
[ 4614.532858] ttyS0: LSR safety check engaged!

So anything else I can provide?

-- 
Len Sorensen

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-04-01 22:57                               ` Lennart Sorensen
@ 2009-04-03 14:46                                 ` Mark Lord
  2009-04-03 15:16                                   ` Lennart Sorensen
  0 siblings, 1 reply; 664+ messages in thread
From: Mark Lord @ 2009-04-03 14:46 UTC (permalink / raw)
  To: Lennart Sorensen
  Cc: Andrew Morton, torvalds, tytso, drees76, jesper, linux-kernel

Lennart Sorensen wrote:
> On Wed, Apr 01, 2009 at 02:36:22PM -0700, Andrew Morton wrote:
>> Back in 2002ish I did a *lot* of work on IO latency, reads-vs-writes,
>> etc, etc (but not fsync - for practical purposes it's unfixable on
>> ext3-ordered)
>>
>> Performance was pretty good.  From some of the descriptions I'm seeing
>> get tossed around lately, I suspect that it has regressed.
>>
>> It would be useful/interesting if people were to rerun some of these
>> tests with `echo anticipatory > /sys/block/sda/queue/scheduler'.
>>
>> Or with linux-2.5.60 :(
> 
> Well 2.6.18 seems to keep popping up as the last kernel with "sane"
> behaviour, at least in terms of not causing huge delays under many
> workloads.  I currently run 2.6.26, although that could be updated as
> soon as I get around to figuring out why lirc isn't working for me when
> I move past 2.6.26.
> 
> I could certainly try changing the scheduler on my mythtv box and seeing
> if that makes any difference to the behaviour.  It is pretty darn obvious
> whether it is responsive or not when starting to play back a video.
..

My Myth box here was running 2.6.18 when originally set up,
and even back then it still took *minutes* to delete large files.
So that part hasn't really changed much in the interim.

Because of the multi-minute deletes, the distro shutdown scripts
would fails, and power off the box while it was still writing
to the drives.  Ouch.

That system has had XFS on it for the past year and a half now,
and for Myth, there's no reason not to use XFS.  It's great!

Cheers


^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-04-03 14:21                                       ` Lennart Sorensen
@ 2009-04-03 15:05                                         ` Mark Lord
  2009-04-03 15:14                                           ` Lennart Sorensen
  2009-04-03 19:57                                           ` Jeff Garzik
  0 siblings, 2 replies; 664+ messages in thread
From: Mark Lord @ 2009-04-03 15:05 UTC (permalink / raw)
  To: Lennart Sorensen
  Cc: Jens Axboe, Linus Torvalds, Ingo Molnar, Andrew Morton, tytso,
	drees76, jesper, Linux Kernel Mailing List

Lennart Sorensen wrote:
>
> Well the system is setup like this:
> 
> Core 2 Quad Q6600 CPU (2.4GHz quad core).
> Asus P5K mainboard (Intel P35 chipset)
> 6GB of ram
> PVR500 dual NTSC tuner pci card
..
> So the behaviour with cfq is:
> Disk light seems to be constantly on if there is any disk activity.  iotop
> can show a total io of maybe 1MB/s and the disk light is on constantly.
..

Lennart,

I wonder if the problem with your system is really a Myth/driver issue?

Curiously, I have a HVR-1600 card here, and when recording analog TV with
it the disk lights are on constantly.  The problem with it turns out to
be mythbackend doing fsync() calls ten times a second.

My other tuner cards don't have this problem.

So perhaps the PVR-500 triggers the same buggy behaviour as the HVR-1600?
To work around it here, I decided to use a preload library that replaces
the frequent fsync() calls with a more moderated behaviour:

   http://rtr.ca/hvr1600/libfsync.tar.gz

Grab that file and try it out.  Instructions are included within.
Report back again and let us know if it makes any difference.

Someday I may try and chase down the exact bug that causes mythbackend
to go fsyncing berserk like that, but for now this workaround is fine.

Cheers

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-04-03 11:32                                                   ` Chris Mason
@ 2009-04-03 15:07                                                     ` Linus Torvalds
  2009-04-03 15:40                                                       ` Chris Mason
  2009-04-03 20:05                                                       ` Jeff Garzik
  0 siblings, 2 replies; 664+ messages in thread
From: Linus Torvalds @ 2009-04-03 15:07 UTC (permalink / raw)
  To: Chris Mason
  Cc: Jeff Garzik, Andrew Morton, David Rees, Linux Kernel Mailing List



On Fri, 3 Apr 2009, Chris Mason wrote:

> On Thu, 2009-04-02 at 20:34 -0700, Linus Torvalds wrote:
> > 
> > Well, one rather simple explanation is that if you hadn't been doing lots 
> > of writes, then the background garbage collection on the Intel SSD gets 
> > ahead of the game, and gives you lots of bursty nice write bandwidth due 
> > to having a nicely compacted and pre-erased blocks.
> > 
> > Then, after lots of writing, all the pre-erased blocks are gone, and you 
> > are down to a steady state where it needs to GC and erase blocks to make 
> > room for new writes.
> > 
> > So that part doesn't suprise me per se. The Intel SSD's definitely 
> > flucutate a bit timing-wise (but I love how they never degenerate to the 
> > "ooh, that _really_ sucks" case that the other SSD's and the rotational 
> > media I've seen does when you do random writes).
> > 
> 
> 23MB/s seems a bit low though, I'd try with O_DIRECT.  ext3 doesn't do
> writepages, and the ssd may be very sensitive to smaller writes (what
> brand?)

I didn't realize that Jeff had a non-Intel SSD. 

THAT sure explains the huge drop-off. I do see Intel SSD's fluctuating 
too, but the Intel ones tend to be _fairly_ stable.

> > The fact that it also happens for the regular disk does imply that it's 
> > not the _only_ thing going on, though.
> 
> Jeff if you blktrace it I can make up a seekwatcher graph.  My bet is
> that pdflush is stuck writing the indirect blocks, and doing a ton of
> seeks.
> 
> You could change the overwrite program to also do sync_file_range on the
> block device ;)

Actually, that won't help. 'sync_file_range()' works only on the virtually 
indexed page cache, and I think ext3 uses "struct buffer_head *" for all 
it's metadata updates (due to how JBD works). So sync_file_range() will do 
nothing at all to the metadata, regardless of what mapping you execute it 
on.

			Linus

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-04-02 11:05                             ` Janne Grunau
                                                 ` (2 preceding siblings ...)
  2009-04-02 21:50                               ` Lennart Sorensen
@ 2009-04-03 15:07                               ` Mark Lord
  3 siblings, 0 replies; 664+ messages in thread
From: Mark Lord @ 2009-04-03 15:07 UTC (permalink / raw)
  To: Janne Grunau
  Cc: Lennart Sorensen, Andrew Morton, Linus Torvalds, Theodore Tso,
	David Rees, Jesper Krogh, Linux Kernel Mailing List

Janne Grunau wrote:
>..
> MythTV calls fsync every few seconds on ongoing recordings to prevent
> stalls due to large cache writebacks on ext3.
>
> Janne
> (MythTV developer)
..

Oooh.. a myth dev!

With the HVR-1600 card, myth calls fsync() *ten* times a second
while recording analog TV (digital is fine).

Any chance you could track down and fix that ?

It might be the same thing that's biting Lennart's system
with his PVR-500 card.

Cheers

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-04-02 17:04                                     ` Linus Torvalds
  2009-04-02 22:09                                       ` Jeff Garzik
@ 2009-04-03 15:14                                       ` Mark Lord
  2009-04-03 15:18                                         ` Lennart Sorensen
  2009-04-03 15:28                                         ` Linus Torvalds
  1 sibling, 2 replies; 664+ messages in thread
From: Mark Lord @ 2009-04-03 15:14 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrew Morton, David Rees, Janne Grunau, Lennart Sorensen,
	Theodore Tso, Jesper Krogh, Linux Kernel Mailing List

Linus Torvalds wrote:
> 
> On Thu, 2 Apr 2009, Linus Torvalds wrote:
>> On Thu, 2 Apr 2009, Andrew Morton wrote:
>>> A suitable design for the streaming might be, every 4MB:
>>>
>>> - run sync_file_range(SYNC_FILE_RANGE_WRITE) to get the 4MB underway
>>>   to the disk
>>>
>>> - run fadvise(POSIX_FADV_DONTNEED) against the previous 4MB to
>>>   discard it from pagecache.
>> Here's an example. I call it "overwrite.c" for obvious reasons.
> 
> Oh, except my example doesn't do the fadvise. Instead, I make sure to 
> throttle the writes and the old range with
> 
> 	SYNC_FILE_RANGE_WAIT_BEFORE|SYNC_FILE_RANGE_WRITE|SYNC_FILE_RANGE_WAIT_AFTER
> 
> which makes sure that the old pages are easily dropped by the VM - and 
> they will be, since they end up always being on the cold list.
> 
> I _wanted_ to add a SYNC_FILE_RANGE_DROP but I never bothered because this 
> particular load it didn't matter. The system was perfectly usable while 
> overwriting even huge disks because there was never more than 8MB of dirty 
> data in flight in the IO queues at any time.
..

Note that for mythtv, this may not be the best behaviour.

A common use scenario is "watching live TV", a few minutes behind
real-time so that the commercial-skipping can work its magic.

In that scenario, those pages are going to be needed again
within a short while, and it might be useful to keep them around.

But then Myth itself could probably decide whether to discard them
or not, not based upon that kind of knowledge.

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-04-03 15:05                                         ` Mark Lord
@ 2009-04-03 15:14                                           ` Lennart Sorensen
  2009-04-03 19:57                                           ` Jeff Garzik
  1 sibling, 0 replies; 664+ messages in thread
From: Lennart Sorensen @ 2009-04-03 15:14 UTC (permalink / raw)
  To: Mark Lord
  Cc: Jens Axboe, Linus Torvalds, Ingo Molnar, Andrew Morton, tytso,
	drees76, jesper, Linux Kernel Mailing List

On Fri, Apr 03, 2009 at 11:05:04AM -0400, Mark Lord wrote:
> I wonder if the problem with your system is really a Myth/driver issue?

Could be.  That is the point of the box after all.

> Curiously, I have a HVR-1600 card here, and when recording analog TV with
> it the disk lights are on constantly.  The problem with it turns out to
> be mythbackend doing fsync() calls ten times a second.

But why would anticipatory help that?

> My other tuner cards don't have this problem.
>
> So perhaps the PVR-500 triggers the same buggy behaviour as the HVR-1600?
> To work around it here, I decided to use a preload library that replaces
> the frequent fsync() calls with a more moderated behaviour:
>
>   http://rtr.ca/hvr1600/libfsync.tar.gz
>
> Grab that file and try it out.  Instructions are included within.
> Report back again and let us know if it makes any difference.

I can have a try at that.  I will see how cfq behaves with that installed.

> Someday I may try and chase down the exact bug that causes mythbackend
> to go fsyncing berserk like that, but for now this workaround is fine.

Well if it is the real cause of the bad behaviour then it would certainly
be good to track down.

-- 
Len Sorensen

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-04-03 14:46                                 ` Mark Lord
@ 2009-04-03 15:16                                   ` Lennart Sorensen
  2009-04-03 15:42                                     ` Mark Lord
  2009-04-03 18:59                                     ` Jeff Garzik
  0 siblings, 2 replies; 664+ messages in thread
From: Lennart Sorensen @ 2009-04-03 15:16 UTC (permalink / raw)
  To: Mark Lord; +Cc: Andrew Morton, torvalds, tytso, drees76, jesper, linux-kernel

On Fri, Apr 03, 2009 at 10:46:34AM -0400, Mark Lord wrote:
> My Myth box here was running 2.6.18 when originally set up,
> and even back then it still took *minutes* to delete large files.
> So that part hasn't really changed much in the interim.
>
> Because of the multi-minute deletes, the distro shutdown scripts
> would fails, and power off the box while it was still writing
> to the drives.  Ouch.
>
> That system has had XFS on it for the past year and a half now,
> and for Myth, there's no reason not to use XFS.  It's great!

Mythtv has a 'slow delete' option that I believe works by slowly
truncating the file.  Seems they believe that ext3 is bad at handling
large file deletes, so they try to spread out the pain.  I don't remember
if that option is on by default or not.  I turned it off.

-- 
Len Sorensen

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-04-03 15:14                                       ` Mark Lord
@ 2009-04-03 15:18                                         ` Lennart Sorensen
  2009-04-03 15:46                                           ` Mark Lord
  2009-04-03 15:28                                         ` Linus Torvalds
  1 sibling, 1 reply; 664+ messages in thread
From: Lennart Sorensen @ 2009-04-03 15:18 UTC (permalink / raw)
  To: Mark Lord
  Cc: Linus Torvalds, Andrew Morton, David Rees, Janne Grunau,
	Theodore Tso, Jesper Krogh, Linux Kernel Mailing List

On Fri, Apr 03, 2009 at 11:14:39AM -0400, Mark Lord wrote:
> Note that for mythtv, this may not be the best behaviour.
>
> A common use scenario is "watching live TV", a few minutes behind
> real-time so that the commercial-skipping can work its magic.

Well I really never watch live TV.  I watch shows when I want to, not
when they happen to be on the air.  So I certainly couldn't care less
if they were no longer cached.

> In that scenario, those pages are going to be needed again
> within a short while, and it might be useful to keep them around.

Within 1 minute might be a lot of data for an MPEG2 stream.

> But then Myth itself could probably decide whether to discard them
> or not, not based upon that kind of knowledge.

-- 
Len Sorensen

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-04-03 15:14                                       ` Mark Lord
  2009-04-03 15:18                                         ` Lennart Sorensen
@ 2009-04-03 15:28                                         ` Linus Torvalds
  1 sibling, 0 replies; 664+ messages in thread
From: Linus Torvalds @ 2009-04-03 15:28 UTC (permalink / raw)
  To: Mark Lord
  Cc: Andrew Morton, David Rees, Janne Grunau, Lennart Sorensen,
	Theodore Tso, Jesper Krogh, Linux Kernel Mailing List



On Fri, 3 Apr 2009, Mark Lord wrote:
> 
> Note that for mythtv, this may not be the best behaviour.
> 
> A common use scenario is "watching live TV", a few minutes behind
> real-time so that the commercial-skipping can work its magic.
> 
> In that scenario, those pages are going to be needed again
> within a short while, and it might be useful to keep them around.
> 
> But then Myth itself could probably decide whether to discard them
> or not, not based upon that kind of knowledge.

Yes. I suspect that Myth could do heuristics like "when watching live TV, 
do drop-behind about 30s after the currently showing stream". That still 
allows for replay, but older stuff you've watched really likely isn't all 
that interesting and migth be worth dropping in order to make room for 
more data.

And you can use posix_fadvise() for that, since it's now no longer 
connected with "wait for background IO to complete" at all.

The reason for wanting "SYNC_FILE_RANGE_DROP" was simply that I was doing 
the "wait after write" anyway, and thinking I wanted to get rid of the 
pages while I was already handling them. But that was for an app where I 
_new_ the data was uninteresting as soon as it was on disk. Doing a secure 
delete is different from recording video ;)

				Linus

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-04-03 15:07                                                     ` Linus Torvalds
@ 2009-04-03 15:40                                                       ` Chris Mason
  2009-04-03 20:05                                                       ` Jeff Garzik
  1 sibling, 0 replies; 664+ messages in thread
From: Chris Mason @ 2009-04-03 15:40 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jeff Garzik, Andrew Morton, David Rees, Linux Kernel Mailing List

On Fri, 2009-04-03 at 08:07 -0700, Linus Torvalds wrote:
> 
> On Fri, 3 Apr 2009, Chris Mason wrote:
> 
> > On Thu, 2009-04-02 at 20:34 -0700, Linus Torvalds wrote:
> > > 
> > > Well, one rather simple explanation is that if you hadn't been doing lots 
> > > of writes, then the background garbage collection on the Intel SSD gets 
> > > ahead of the game, and gives you lots of bursty nice write bandwidth due 
> > > to having a nicely compacted and pre-erased blocks.
> > > 
> > > Then, after lots of writing, all the pre-erased blocks are gone, and you 
> > > are down to a steady state where it needs to GC and erase blocks to make 
> > > room for new writes.
> > > 
> > > So that part doesn't suprise me per se. The Intel SSD's definitely 
> > > flucutate a bit timing-wise (but I love how they never degenerate to the 
> > > "ooh, that _really_ sucks" case that the other SSD's and the rotational 
> > > media I've seen does when you do random writes).
> > > 
> > 
> > 23MB/s seems a bit low though, I'd try with O_DIRECT.  ext3 doesn't do
> > writepages, and the ssd may be very sensitive to smaller writes (what
> > brand?)
> 
> I didn't realize that Jeff had a non-Intel SSD. 
> 
> THAT sure explains the huge drop-off. I do see Intel SSD's fluctuating 
> too, but the Intel ones tend to be _fairly_ stable.

Even the intel ones have cliffs for long running random io workloads
(where the bottom of the cliff is still very fast), but something like
this should be stable.

> 
> > > The fact that it also happens for the regular disk does imply that it's 
> > > not the _only_ thing going on, though.
> > 
> > Jeff if you blktrace it I can make up a seekwatcher graph.  My bet is
> > that pdflush is stuck writing the indirect blocks, and doing a ton of
> > seeks.
> > 
> > You could change the overwrite program to also do sync_file_range on the
> > block device ;)
> 
> Actually, that won't help. 'sync_file_range()' works only on the virtually 
> indexed page cache, and I think ext3 uses "struct buffer_head *" for all 
> it's metadata updates (due to how JBD works). So sync_file_range() will do 
> nothing at all to the metadata, regardless of what mapping you execute it 
> on.

The buffer heads do end up on the block device inode's pages, and ext3
is letting pdflush do some of the writeback.  Its hard to say if the
sync_file_range is going to help, the IO on the metadata may be random
enough for that ssd that it won't really matter who writes it or when.

Spinning disks might suck, but at least they all suck in the same
way...tuning for all these different ssds isn't going to be fun at all.

-chris



^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-04-03 15:16                                   ` Lennart Sorensen
@ 2009-04-03 15:42                                     ` Mark Lord
  2009-04-03 18:59                                     ` Jeff Garzik
  1 sibling, 0 replies; 664+ messages in thread
From: Mark Lord @ 2009-04-03 15:42 UTC (permalink / raw)
  To: Lennart Sorensen
  Cc: Andrew Morton, torvalds, tytso, drees76, jesper, linux-kernel

Lennart Sorensen wrote:
> On Fri, Apr 03, 2009 at 10:46:34AM -0400, Mark Lord wrote:
>> My Myth box here was running 2.6.18 when originally set up,
>> and even back then it still took *minutes* to delete large files.
>> So that part hasn't really changed much in the interim.
>>
>> Because of the multi-minute deletes, the distro shutdown scripts
>> would fails, and power off the box while it was still writing
>> to the drives.  Ouch.
>>
>> That system has had XFS on it for the past year and a half now,
>> and for Myth, there's no reason not to use XFS.  It's great!
> 
> Mythtv has a 'slow delete' option that I believe works by slowly
> truncating the file.  Seems they believe that ext3 is bad at handling
> large file deletes, so they try to spread out the pain.  I don't remember
> if that option is on by default or not.  I turned it off.
..


That option doesn't make much difference for the shutdown failure.
And with XFS there's no need for it, so I now have it "off".

Cheers


^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-04-03 15:18                                         ` Lennart Sorensen
@ 2009-04-03 15:46                                           ` Mark Lord
  0 siblings, 0 replies; 664+ messages in thread
From: Mark Lord @ 2009-04-03 15:46 UTC (permalink / raw)
  To: Lennart Sorensen
  Cc: Linus Torvalds, Andrew Morton, David Rees, Janne Grunau,
	Theodore Tso, Jesper Krogh, Linux Kernel Mailing List

Lennart Sorensen wrote:
> On Fri, Apr 03, 2009 at 11:14:39AM -0400, Mark Lord wrote:
>> Note that for mythtv, this may not be the best behaviour.
>>
>> A common use scenario is "watching live TV", a few minutes behind
>> real-time so that the commercial-skipping can work its magic.
> 
> Well I really never watch live TV.
..

A *true* myth dev!  (pretenders use LiveTV, *real* devs don't!)

But mythcommflag also benefits from having the pages
hang around for an extra short time.

Cheers

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-04-03 15:16                                   ` Lennart Sorensen
  2009-04-03 15:42                                     ` Mark Lord
@ 2009-04-03 18:59                                     ` Jeff Garzik
  2009-04-04  8:18                                       ` Andrew Morton
  1 sibling, 1 reply; 664+ messages in thread
From: Jeff Garzik @ 2009-04-03 18:59 UTC (permalink / raw)
  To: Lennart Sorensen
  Cc: Mark Lord, Andrew Morton, torvalds, tytso, drees76, jesper, linux-kernel

Lennart Sorensen wrote:
> On Fri, Apr 03, 2009 at 10:46:34AM -0400, Mark Lord wrote:
>> My Myth box here was running 2.6.18 when originally set up,
>> and even back then it still took *minutes* to delete large files.
>> So that part hasn't really changed much in the interim.
>>
>> Because of the multi-minute deletes, the distro shutdown scripts
>> would fails, and power off the box while it was still writing
>> to the drives.  Ouch.
>>
>> That system has had XFS on it for the past year and a half now,
>> and for Myth, there's no reason not to use XFS.  It's great!
> 
> Mythtv has a 'slow delete' option that I believe works by slowly
> truncating the file.  Seems they believe that ext3 is bad at handling
> large file deletes, so they try to spread out the pain.  I don't remember
> if that option is on by default or not.  I turned it off.

It's pretty painful for super-large files with lots of metadata.

	Jeff




^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-04-03 15:05                                         ` Mark Lord
  2009-04-03 15:14                                           ` Lennart Sorensen
@ 2009-04-03 19:57                                           ` Jeff Garzik
  2009-04-03 21:28                                             ` Janne Grunau
  1 sibling, 1 reply; 664+ messages in thread
From: Jeff Garzik @ 2009-04-03 19:57 UTC (permalink / raw)
  To: Mark Lord
  Cc: Lennart Sorensen, Jens Axboe, Linus Torvalds, Ingo Molnar,
	Andrew Morton, tytso, drees76, jesper, Linux Kernel Mailing List

Mark Lord wrote:
> Lennart Sorensen wrote:
>>
>> Well the system is setup like this:
>>
>> Core 2 Quad Q6600 CPU (2.4GHz quad core).
>> Asus P5K mainboard (Intel P35 chipset)
>> 6GB of ram
>> PVR500 dual NTSC tuner pci card
> ..
>> So the behaviour with cfq is:
>> Disk light seems to be constantly on if there is any disk activity.  
>> iotop
>> can show a total io of maybe 1MB/s and the disk light is on constantly.
> ..
> 
> Lennart,
> 
> I wonder if the problem with your system is really a Myth/driver issue?
> 
> Curiously, I have a HVR-1600 card here, and when recording analog TV with
> it the disk lights are on constantly.  The problem with it turns out to
> be mythbackend doing fsync() calls ten times a second.
> 
> My other tuner cards don't have this problem.
> 
> So perhaps the PVR-500 triggers the same buggy behaviour as the HVR-1600?
> To work around it here, I decided to use a preload library that replaces
> the frequent fsync() calls with a more moderated behaviour:
> 
>   http://rtr.ca/hvr1600/libfsync.tar.gz
> 
> Grab that file and try it out.  Instructions are included within.
> Report back again and let us know if it makes any difference.
> 
> Someday I may try and chase down the exact bug that causes mythbackend
> to go fsyncing berserk like that, but for now this workaround is fine.

mythtv/libs/libmythtv/ThreadedFileWriter.cpp is a good place to start 
(Sync method... uses fdatasync if available, fsync if not).

mythtv is definitely a candidate for sync_file_range() style output, IMO.

	Jeff



^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-04-03 15:07                                                     ` Linus Torvalds
  2009-04-03 15:40                                                       ` Chris Mason
@ 2009-04-03 20:05                                                       ` Jeff Garzik
  2009-04-03 20:14                                                         ` Linus Torvalds
  2009-04-04 12:44                                                         ` Mark Lord
  1 sibling, 2 replies; 664+ messages in thread
From: Jeff Garzik @ 2009-04-03 20:05 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Chris Mason, Andrew Morton, David Rees, Linux Kernel Mailing List

[-- Attachment #1: Type: text/plain, Size: 342 bytes --]

Linus Torvalds wrote:
> I didn't realize that Jeff had a non-Intel SSD. 
> THAT sure explains the huge drop-off. I do see Intel SSD's fluctuating 
> too, but the Intel ones tend to be _fairly_ stable.

Yeah, it's a no-name SSD.

I've attached 'hdparm -I' in case anyone is curious.  It's from 
newegg.com, so nothing NDA'd or sekrit.

	Jeff


[-- Attachment #2: ssd-hdparm.txt --]
[-- Type: text/plain, Size: 1335 bytes --]


/dev/sdb:

ATA device, with non-removable media
	Model Number:       G.SKILL 128GB SSD                       
	Serial Number:      MK0108480A545003B   
	Firmware Revision:  02.10104
Standards:
	Used: ATA/ATAPI-7 T13 1532D revision 4a 
	Supported: 8 7 6 5 & some of 8
Configuration:
	Logical		max	current
	cylinders	16383	16383
	heads		16	16
	sectors/track	63	63
	--
	CHS current addressable sectors:   16514064
	LBA    user addressable sectors:  250445824
	device size with M = 1024*1024:      122288 MBytes
	device size with M = 1000*1000:      128228 MBytes (128 GB)
Capabilities:
	LBA, IORDY(can be disabled)
	Standby timer values: spec'd by Standard, no device specific minimum
	R/W multiple sector transfer: Max = 1	Current = ?
	Recommended acoustic management value: 128, current value: 254
	DMA: mdma0 mdma1 mdma2 udma0 udma1 udma2 udma3 udma4 *udma5 
	     Cycle time: min=120ns recommended=120ns
	PIO: pio0 pio1 pio2 pio3 pio4 
	     Cycle time: no flow control=120ns  IORDY flow control=120ns
Commands/features:
	Enabled	Supported:
	   *	SMART feature set
	   *	Power Management feature set
	    	Write cache
	    	Look-ahead
	   *	Mandatory FLUSH_CACHE
	   *	SATA-I signaling speed (1.5Gb/s)
	   *	SATA-II signaling speed (3.0Gb/s)
	   *	Host-initiated interface power management
	   *	Phy event counters
Checksum: correct

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-04-03 20:05                                                       ` Jeff Garzik
@ 2009-04-03 20:14                                                         ` Linus Torvalds
  2009-04-03 21:48                                                           ` Jeff Garzik
  2009-04-03 23:35                                                           ` Dave Jones
  2009-04-04 12:44                                                         ` Mark Lord
  1 sibling, 2 replies; 664+ messages in thread
From: Linus Torvalds @ 2009-04-03 20:14 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Chris Mason, Andrew Morton, David Rees, Linux Kernel Mailing List



On Fri, 3 Apr 2009, Jeff Garzik wrote:
>
> Yeah, it's a no-name SSD.
> 
> I've attached 'hdparm -I' in case anyone is curious.  It's from newegg.com, so
> nothing NDA'd or sekrit.

Hmm. Does it do ok on the "random write" test? There's a few non-intel 
controllers that are fine - apparently the newer samsung ones, and the one 
from Indilinx.

But I _think_ G.SKILL uses those horribly broken JMicron controllers. 
Judging by your performance numbers, it's the slightly fancier double 
controller version (ie basically an internal RAID0 of two identical 
JMicron controllers, each handling half of the flash chips).

Try a random write test. If it's the JMicron controllers, performance will 
plummet to a few tens of kilobytes per second.

			Linus

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-04-03 19:57                                           ` Jeff Garzik
@ 2009-04-03 21:28                                             ` Janne Grunau
  2009-04-03 21:57                                               ` Jeff Garzik
  0 siblings, 1 reply; 664+ messages in thread
From: Janne Grunau @ 2009-04-03 21:28 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Mark Lord, Lennart Sorensen, Jens Axboe, Linus Torvalds,
	Ingo Molnar, Andrew Morton, tytso, drees76, jesper,
	Linux Kernel Mailing List

On Fri, Apr 03, 2009 at 03:57:52PM -0400, Jeff Garzik wrote:
> Mark Lord wrote:
> > 
> > I wonder if the problem with your system is really a Myth/driver issue?
> > 
> > Curiously, I have a HVR-1600 card here, and when recording analog TV with
> > it the disk lights are on constantly.  The problem with it turns out to
> > be mythbackend doing fsync() calls ten times a second.
> > 
> > My other tuner cards don't have this problem.
> > 
> > So perhaps the PVR-500 triggers the same buggy behaviour as the HVR-1600?
> > To work around it here, I decided to use a preload library that replaces
> > the frequent fsync() calls with a more moderated behaviour:
> > 
> >   http://rtr.ca/hvr1600/libfsync.tar.gz
> > 
> > Grab that file and try it out.  Instructions are included within.
> > Report back again and let us know if it makes any difference.
> > 
> > Someday I may try and chase down the exact bug that causes mythbackend
> > to go fsyncing berserk like that, but for now this workaround is fine.

that sounds if it indeed syncs every 100ms instead of once per second
over the whole recording. It's inteneded behaviour for the first 64K.

> mythtv/libs/libmythtv/ThreadedFileWriter.cpp is a good place to start 
> (Sync method... uses fdatasync if available, fsync if not).
> 
> mythtv is definitely a candidate for sync_file_range() style output, IMO.

yeah, I'm on it.

Janne

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-04-03 20:14                                                         ` Linus Torvalds
@ 2009-04-03 21:48                                                           ` Jeff Garzik
  2009-04-03 22:06                                                             ` Linus Torvalds
  2009-04-03 23:35                                                           ` Dave Jones
  1 sibling, 1 reply; 664+ messages in thread
From: Jeff Garzik @ 2009-04-03 21:48 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Chris Mason, Andrew Morton, David Rees, Linux Kernel Mailing List

Linus Torvalds wrote:
> 
> On Fri, 3 Apr 2009, Jeff Garzik wrote:
>> Yeah, it's a no-name SSD.
>>
>> I've attached 'hdparm -I' in case anyone is curious.  It's from newegg.com, so
>> nothing NDA'd or sekrit.
> 
> Hmm. Does it do ok on the "random write" test? There's a few non-intel 
> controllers that are fine - apparently the newer samsung ones, and the one 
> from Indilinx.
> 
> But I _think_ G.SKILL uses those horribly broken JMicron controllers. 
> Judging by your performance numbers, it's the slightly fancier double 
> controller version (ie basically an internal RAID0 of two identical 
> JMicron controllers, each handling half of the flash chips).

Quoting from the review at
http://www.bit-tech.net/hardware/storage/2008/12/03/g-skill-patriot-and-intel-ssd-test/2

"Cracking the drive open reveals the PCB fitted with sixteen Samsung 
840, 8GB MLC NAND flash memory modules, linked to a J-Micron JMF 602 
storage controller chip."


> Try a random write test. If it's the JMicron controllers, performance will 
> plummet to a few tens of kilobytes per second.

Since I am hacking on osdblk currently, I was too slack to code up a 
test.  This is what bonnie++ says, at least...

> Version 1.03c       ------Sequential Output------ --Sequential Input- --Random-
>                     -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
> Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
> bd.yyz.us     8000M           28678   6 27836   5           133246  12  5237  10
>                     ------Sequential Create------ --------Random Create--------
>                     -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
>               files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
>                  16 +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++
> bd.yyz.us,8000M,,,28678,6,27836,5,,,133246,12,5236.6,10,16,+++++,+++,+++++,+++,+++++,+++,+++++,+++,+++++,+++,+++++,+++

But I guess seeks are not very helpful on an SSD :)  Any pre-built 
random write tests out there?

Regards,

	Jeff




^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-04-03 21:28                                             ` Janne Grunau
@ 2009-04-03 21:57                                               ` Jeff Garzik
  2009-04-03 22:32                                                 ` Janne Grunau
  2009-04-03 22:53                                                 ` David Rees
  0 siblings, 2 replies; 664+ messages in thread
From: Jeff Garzik @ 2009-04-03 21:57 UTC (permalink / raw)
  To: Janne Grunau
  Cc: Mark Lord, Lennart Sorensen, Jens Axboe, Linus Torvalds,
	Ingo Molnar, Andrew Morton, tytso, drees76, jesper,
	Linux Kernel Mailing List

Janne Grunau wrote:
> On Fri, Apr 03, 2009 at 03:57:52PM -0400, Jeff Garzik wrote:
>> Mark Lord wrote:
>>> Grab that file and try it out.  Instructions are included within.
>>> Report back again and let us know if it makes any difference.
>>>
>>> Someday I may try and chase down the exact bug that causes mythbackend
>>> to go fsyncing berserk like that, but for now this workaround is fine.
> 
> that sounds if it indeed syncs every 100ms instead of once per second
> over the whole recording. It's inteneded behaviour for the first 64K.
> 
>> mythtv/libs/libmythtv/ThreadedFileWriter.cpp is a good place to start 
>> (Sync method... uses fdatasync if available, fsync if not).
>>
>> mythtv is definitely a candidate for sync_file_range() style output, IMO.

> yeah, I'm on it.

Just curious, does MythTV need fsync(), or merely to tell the kernel to 
begin asynchronously writing data to storage?

sync_file_range(..., SYNC_FILE_RANGE_WRITE) might be enough, if you do 
not need to actually wait for completion.

This may be the case, if the idea behind MythTV's fsync(2) is simply to 
prevent the kernel from building up a huge amount of dirty pages in the 
pagecache [which, in turn, produces bursty write-out behavior].

	Jeff



^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-04-03 21:48                                                           ` Jeff Garzik
@ 2009-04-03 22:06                                                             ` Linus Torvalds
  2009-04-03 23:48                                                               ` Jeff Garzik
  0 siblings, 1 reply; 664+ messages in thread
From: Linus Torvalds @ 2009-04-03 22:06 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Chris Mason, Andrew Morton, David Rees, Linux Kernel Mailing List



On Fri, 3 Apr 2009, Jeff Garzik wrote:
>
> Since I am hacking on osdblk currently, I was too slack to code up a test.
> This is what bonnie++ says, at least...

Afaik, bonnie does it all in the page cache, and only tests random reads 
(in the "random seek" test), not random writes.

> But I guess seeks are not very helpful on an SSD :)  Any pre-built random
> write tests out there?

"fio" does well:

	http://git.kernel.dk/?p=fio.git;a=summary

and I think it comes with a few example files. Here's the random write 
file that Jens suggested, and that works pretty well..

It first creates a 2GB file to do the IO on, then does random 4k writes to 
it with O_DIRECT.

If your SSD does badly at it, you'll just want to kill it, but it shows 
you how many MB/s it's doing (or, in the sucky case, how many kB/s).

		Linus

---
[global]
filename=testfile
size=2g
create_fsync=1
overwrite=1

[randwrites]
# make rw= 'randread' for random reads, 'read' for reads, etc
rw=randwrite
bs=4k
direct=1


^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-04-03  4:06                                 ` Lennart Sorensen
  2009-04-03  4:13                                   ` Linus Torvalds
@ 2009-04-03 22:28                                   ` Jeff Moyer
  2009-04-06 14:15                                     ` Lennart Sorensen
  1 sibling, 1 reply; 664+ messages in thread
From: Jeff Moyer @ 2009-04-03 22:28 UTC (permalink / raw)
  To: Lennart Sorensen
  Cc: Ingo Molnar, Andrew Morton, torvalds, tytso, drees76, jesper,
	linux-kernel, jens.axboe

lsorense@csclub.uwaterloo.ca (Lennart Sorensen) writes:

> On Thu, Apr 02, 2009 at 03:00:44AM +0200, Ingo Molnar wrote:
>> I'll test this (and the other suggestions) once i'm out of the merge 
>> window.
>> 
>> I probably wont test that though ;-)
>> 
>> Going back to v2.6.14 to do pre-mutex-merge performance tests was 
>> already quite a challenge on modern hardware.
>
> Well after a day of running my mythtv box with anticipatiry rather than
> the default cfq scheduler, it certainly looks a lot better.  I haven't
> seen any slowdowns, the disk activity light isn't on solidly (it just
> flashes every couple of seconds instead), and it doesn't even mind
> me lanuching bittornado on multiple torrents at the same time as two
> recordings are taking place and some commercial flagging is taking place.
> With cfq this would usually make the system unusable (and a Q6600 with
> 6GB ram should never be unresponsive in my opinion).
>
> So so far I would rank anticipatory at about 1000x better than cfq for
> my work load.  It sure acts a lot more like it used to back in 2.6.18
> times.

Hi, Lennart,

Could you try one more test, please?  Switch back to CFQ and set
/sys/block/sdX/queue/iosched/slice_idle to 0?

I'm not sure how the applications you are running write to disk, but if
they interleave I/O between processes, this could help.  I'm not too
confident that this will make a difference, though, since CFQ changed to
time-slice based instead of quantum based before 2.6.18.  Still, it
would be another data point if you have the time.

Thanks in advance!

-Jeff

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-04-03 21:57                                               ` Jeff Garzik
@ 2009-04-03 22:32                                                 ` Janne Grunau
  2009-04-03 22:57                                                   ` David Rees
  2009-04-03 23:29                                                   ` Jeff Garzik
  2009-04-03 22:53                                                 ` David Rees
  1 sibling, 2 replies; 664+ messages in thread
From: Janne Grunau @ 2009-04-03 22:32 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Mark Lord, Lennart Sorensen, Jens Axboe, Linus Torvalds,
	Ingo Molnar, Andrew Morton, tytso, drees76, jesper,
	Linux Kernel Mailing List

On Fri, Apr 03, 2009 at 05:57:05PM -0400, Jeff Garzik wrote:
> Janne Grunau wrote:
> > On Fri, Apr 03, 2009 at 03:57:52PM -0400, Jeff Garzik wrote:
> >> mythtv/libs/libmythtv/ThreadedFileWriter.cpp is a good place to start 
> >> (Sync method... uses fdatasync if available, fsync if not).
> >>
> >> mythtv is definitely a candidate for sync_file_range() style output, IMO.
> 
> > yeah, I'm on it.
> 
> Just curious, does MythTV need fsync(), or merely to tell the kernel to 
> begin asynchronously writing data to storage?

quoting the TheadedFileWriter comments

/*
 *   NOTE: This doesn't even try flush our queue of data.
 *   This only ensures that data which has already been sent
 *   to the kernel for this file is written to disk. This 
 *   means that if this backend is writing the data over a 
 *   network filesystem like NFS, then the data will be visible
 *   to the NFS server after this is called. It is also useful
 *   in preventing the kernel from buffering up so many writes
 *   that they steal the CPU for a long time when the write
 *   to disk actually occurs.
 */
 
> sync_file_range(..., SYNC_FILE_RANGE_WRITE) might be enough, if you do 
> not need to actually wait for completion.
>
> This may be the case, if the idea behind MythTV's fsync(2) is simply to 
> prevent the kernel from building up a huge amount of dirty pages in the 
> pagecache [which, in turn, produces bursty write-out behavior].

see above, we care only about the write-out. The f{data}*sync calls are
already in a seperate thread doing nothing else.

Janne

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-04-03 21:57                                               ` Jeff Garzik
  2009-04-03 22:32                                                 ` Janne Grunau
@ 2009-04-03 22:53                                                 ` David Rees
  2009-04-03 23:30                                                   ` Jeff Garzik
  1 sibling, 1 reply; 664+ messages in thread
From: David Rees @ 2009-04-03 22:53 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Janne Grunau, Mark Lord, Lennart Sorensen, Jens Axboe,
	Linus Torvalds, Ingo Molnar, Andrew Morton, tytso, jesper,
	Linux Kernel Mailing List

On Fri, Apr 3, 2009 at 2:57 PM, Jeff Garzik <jeff@garzik.org> wrote:
> Just curious, does MythTV need fsync(), or merely to tell the kernel to
> begin asynchronously writing data to storage?
>
> sync_file_range(..., SYNC_FILE_RANGE_WRITE) might be enough, if you do not
> need to actually wait for completion.
>
> This may be the case, if the idea behind MythTV's fsync(2) is simply to
> prevent the kernel from building up a huge amount of dirty pages in the
> pagecache [which, in turn, produces bursty write-out behavior].

The *only* reason MythTV fsyncs (or fdatasyncs) the data to disk all
the time is to keep a large amount of dirty pages from building up and
then causing horrible latencies when that data starts getting flushed
to disk.

A typical example of this would be that MythTV is recording a show in
the background while playing back another show.

When the dirty limit is hit and data gets flushed to disk, this would
keep the read buffer on the player from happening fast enough and then
playback would stutter.

Instead of telling people ext3 sucks - mount it in writeback or use
xfs or tweak your vm knobs, they simply put a hack in there instead
which largely eliminates the effect.

I don't think many people would care too much if they lost 30-60
seconds of their recorded TV show if the system crashes for whatever
reason.

-Dave

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-04-03 22:32                                                 ` Janne Grunau
@ 2009-04-03 22:57                                                   ` David Rees
  2009-04-03 23:29                                                   ` Jeff Garzik
  1 sibling, 0 replies; 664+ messages in thread
From: David Rees @ 2009-04-03 22:57 UTC (permalink / raw)
  To: Janne Grunau
  Cc: Jeff Garzik, Mark Lord, Lennart Sorensen, Jens Axboe,
	Linus Torvalds, Ingo Molnar, Andrew Morton, tytso, jesper,
	Linux Kernel Mailing List

On Fri, Apr 3, 2009 at 3:32 PM, Janne Grunau <j@jannau.net> wrote:
> On Fri, Apr 03, 2009 at 05:57:05PM -0400, Jeff Garzik wrote:
>> Just curious, does MythTV need fsync(), or merely to tell the kernel to
>> begin asynchronously writing data to storage?
>
> quoting the TheadedFileWriter comments
>
> /*
>  *   NOTE: This doesn't even try flush our queue of data.
>  *   This only ensures that data which has already been sent
>  *   to the kernel for this file is written to disk. This
>  *   means that if this backend is writing the data over a
>  *   network filesystem like NFS, then the data will be visible
>  *   to the NFS server after this is called. It is also useful
>  *   in preventing the kernel from buffering up so many writes
>  *   that they steal the CPU for a long time when the write
>  *   to disk actually occurs.
>  */

There is no need to fsync data on a NFS mount in Linux anymore.  All
NFS mounts are mounted sync by default now unless you explicitly
specify otherwise (and then you should then know what you're getting
in to).

-Dave

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-04-03 22:32                                                 ` Janne Grunau
  2009-04-03 22:57                                                   ` David Rees
@ 2009-04-03 23:29                                                   ` Jeff Garzik
  2009-04-03 23:52                                                     ` Linus Torvalds
  1 sibling, 1 reply; 664+ messages in thread
From: Jeff Garzik @ 2009-04-03 23:29 UTC (permalink / raw)
  Cc: Mark Lord, Lennart Sorensen, Jens Axboe, Linus Torvalds,
	Ingo Molnar, Andrew Morton, tytso, drees76, jesper,
	Linux Kernel Mailing List

Janne Grunau wrote:
> On Fri, Apr 03, 2009 at 05:57:05PM -0400, Jeff Garzik wrote:
>> Janne Grunau wrote:
>>> On Fri, Apr 03, 2009 at 03:57:52PM -0400, Jeff Garzik wrote:
>>>> mythtv/libs/libmythtv/ThreadedFileWriter.cpp is a good place to start 
>>>> (Sync method... uses fdatasync if available, fsync if not).
>>>>
>>>> mythtv is definitely a candidate for sync_file_range() style output, IMO.
>>> yeah, I'm on it.
>> Just curious, does MythTV need fsync(), or merely to tell the kernel to 
>> begin asynchronously writing data to storage?
> 
> quoting the TheadedFileWriter comments
> 
> /*
>  *   NOTE: This doesn't even try flush our queue of data.
>  *   This only ensures that data which has already been sent
>  *   to the kernel for this file is written to disk. This 
>  *   means that if this backend is writing the data over a 
>  *   network filesystem like NFS, then the data will be visible
>  *   to the NFS server after this is called. It is also useful
>  *   in preventing the kernel from buffering up so many writes
>  *   that they steal the CPU for a long time when the write
>  *   to disk actually occurs.
>  */
>  
>> sync_file_range(..., SYNC_FILE_RANGE_WRITE) might be enough, if you do 
>> not need to actually wait for completion.
>>
>> This may be the case, if the idea behind MythTV's fsync(2) is simply to 
>> prevent the kernel from building up a huge amount of dirty pages in the 
>> pagecache [which, in turn, produces bursty write-out behavior].
> 
> see above, we care only about the write-out. The f{data}*sync calls are
> already in a seperate thread doing nothing else.

If all you want to do is _start_ the write-out from kernel to disk, and 
let the kernel handle it asynchronously, SYNC_FILE_RANGE_WRITE will do 
that for you, eliminating the need for a separate thread.

If you need to wait for the data to hit disk, you will need the other 
SYNC_FILE_RANGE_xxx bits.



On a related subject, reads:  consider 
posix_fadvise(POSIX_FADV_SEQUENTIAL) and/or readahead(2) for optimizing 
the reading side of things.

	Jeff




	Jeff



^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-04-03 22:53                                                 ` David Rees
@ 2009-04-03 23:30                                                   ` Jeff Garzik
  2009-04-04 16:29                                                     ` Janne Grunau
  0 siblings, 1 reply; 664+ messages in thread
From: Jeff Garzik @ 2009-04-03 23:30 UTC (permalink / raw)
  To: David Rees
  Cc: Janne Grunau, Mark Lord, Lennart Sorensen, Jens Axboe,
	Linus Torvalds, Ingo Molnar, Andrew Morton, tytso, jesper,
	Linux Kernel Mailing List

David Rees wrote:
> The *only* reason MythTV fsyncs (or fdatasyncs) the data to disk all
> the time is to keep a large amount of dirty pages from building up and
> then causing horrible latencies when that data starts getting flushed
> to disk.

sync_file_range() will definitely help that situation.

	Jeff



^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-04-03 20:14                                                         ` Linus Torvalds
  2009-04-03 21:48                                                           ` Jeff Garzik
@ 2009-04-03 23:35                                                           ` Dave Jones
  1 sibling, 0 replies; 664+ messages in thread
From: Dave Jones @ 2009-04-03 23:35 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jeff Garzik, Chris Mason, Andrew Morton, David Rees,
	Linux Kernel Mailing List

On Fri, Apr 03, 2009 at 01:14:00PM -0700, Linus Torvalds wrote:
 > But I _think_ G.SKILL uses those horribly broken JMicron controllers. 
 > Judging by your performance numbers, it's the slightly fancier double 
 > controller version (ie basically an internal RAID0 of two identical 
 > JMicron controllers, each handling half of the flash chips).
 > 
 > Try a random write test. If it's the JMicron controllers, performance will 
 > plummet to a few tens of kilobytes per second.

I got the 64GB variant of Jeff's g-skill SSD.  When I first got it,
I ran aio-stress on it.  The numbers from the smaller blocksize tests
are pitiful.  To the extent that after running for 24hrs, I ctrl-c'd
the test.  Really, really abysmal.

	Dave


^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-04-03 22:06                                                             ` Linus Torvalds
@ 2009-04-03 23:48                                                               ` Jeff Garzik
  2009-04-04 12:46                                                                 ` Mark Lord
  0 siblings, 1 reply; 664+ messages in thread
From: Jeff Garzik @ 2009-04-03 23:48 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Chris Mason, Andrew Morton, Jens Axboe, Linux Kernel Mailing List

Linus Torvalds wrote:
> "fio" does well:
> 
> 	http://git.kernel.dk/?p=fio.git;a=summary

Neat tool, Jens...


> and I think it comes with a few example files. Here's the random write 
> file that Jens suggested, and that works pretty well..
> 
> It first creates a 2GB file to do the IO on, then does random 4k writes to 
> it with O_DIRECT.
> 
> If your SSD does badly at it, you'll just want to kill it, but it shows 
> you how many MB/s it's doing (or, in the sucky case, how many kB/s).


heh, so far, the SSD is poking along...

Jobs: 1 (f=1): [w] [2.5% done] [     0/   282 kb/s] [eta 02h:24m:59s]


Compared to the same job file, started at the same time, on the Seagate 
500GB SATA:

Jobs: 1 (f=1): [w] [9.9% done] [     0/  1204 kb/s] [eta 26m:28s]

Regards,

	Jeff




^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-04-03 23:29                                                   ` Jeff Garzik
@ 2009-04-03 23:52                                                     ` Linus Torvalds
  0 siblings, 0 replies; 664+ messages in thread
From: Linus Torvalds @ 2009-04-03 23:52 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Mark Lord, Lennart Sorensen, Jens Axboe, Ingo Molnar,
	Andrew Morton, tytso, drees76, jesper, Linux Kernel Mailing List



On Fri, 3 Apr 2009, Jeff Garzik wrote:
> 
> If all you want to do is _start_ the write-out from kernel to disk, and let
> the kernel handle it asynchronously, SYNC_FILE_RANGE_WRITE will do that for
> you, eliminating the need for a separate thread.

It may not eliminate the need for a separate thread.

SYNC_FILE_RANGE_WRITE will still block on things. It just will block on 
_much_ less than fsync.

In particular, it will block on:

 - actually queuing up the IO (ie we need to get the bio, request etc all 
   allocated and queued up)

 - if a page is under writeback, and has been marked dirty since that 
   writeback started, we'll wait for that IO to finish in order to start a 
   new one.

and depending on load, both of these things _can_ be issues and you might 
still want to do the SYNC_FILE_RANGE_WRITE as a async thread separate 
from the main loop so that the latency of the main loop is not 
affected by that.

But the latencies will be _much_ smaller issues than with f[data]sync(), 
though, especially if you're not ever really hitting the limits on the 
disk subsystem. Because those will additionally

 - wait for all old writeback to complete (whether the page was dirtied 
   after the writeback started or not)

 - additionally, wait for all the new writeback it started.

 - wait for the metadata too (fsync()).

so they are pretty much _guaranteed_ to sleep for actual IO to complete 
(unless you didn't write anything at all to the file ;)

> On a related subject, reads:  consider posix_fadvise(POSIX_FADV_SEQUENTIAL)
> and/or readahead(2) for optimizing the reading side of things.

I doubt POSIX_FADV_SEQUENTIAL will do very much. The kernel tends to 
figure out the read patterns on its own pretty well. Of course, explicit 
readahead() can be noticeable for the right patterns.

		Linus

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-04-03 18:59                                     ` Jeff Garzik
@ 2009-04-04  8:18                                       ` Andrew Morton
  2009-04-04 12:40                                         ` Mark Lord
  2009-04-05  1:57                                         ` David Newall
  0 siblings, 2 replies; 664+ messages in thread
From: Andrew Morton @ 2009-04-04  8:18 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Lennart Sorensen, Mark Lord, torvalds, tytso, drees76, jesper,
	linux-kernel

On Fri, 03 Apr 2009 14:59:12 -0400 Jeff Garzik <jeff@garzik.org> wrote:

> Lennart Sorensen wrote:
> > On Fri, Apr 03, 2009 at 10:46:34AM -0400, Mark Lord wrote:
> >> My Myth box here was running 2.6.18 when originally set up,
> >> and even back then it still took *minutes* to delete large files.
> >> So that part hasn't really changed much in the interim.
> >>
> >> Because of the multi-minute deletes, the distro shutdown scripts
> >> would fails, and power off the box while it was still writing
> >> to the drives.  Ouch.
> >>
> >> That system has had XFS on it for the past year and a half now,
> >> and for Myth, there's no reason not to use XFS.  It's great!
> > 
> > Mythtv has a 'slow delete' option that I believe works by slowly
> > truncating the file.  Seems they believe that ext3 is bad at handling
> > large file deletes, so they try to spread out the pain.  I don't remember
> > if that option is on by default or not.  I turned it off.
> 
> It's pretty painful for super-large files with lots of metadata.
> 

yeah.

There's a dirty hack you can do where you append one byte to the file
every 4MB, across 1GB (say).  That will then lay the file out on-disk as

one bitmap block
one data block
one bitmap block
one data block
one bitmap block
one data block
one bitmap block
one data block
<etc>
lots-of-data-blocks

So when the time comes to delete that gigabyte, the bitmaps blocks are
only one block apart, and reading them is much faster.

That was one of the gruesome hacks I did way back when I was in the
streaming video recording game.

Another was the slow-delete thing.

- open the file

- unlink the file

- now sit in a loop, slowly nibbling away at the tail with
  ftruncate() until the file is gone.

The open/unlink was there so that if the system were to crash midway,
ext3 orphan recovery at reboot time would fully delete the remainder of
the file.


Another was to add an ioctl to ext3 to extend the file outside EOF, but
only metadata - the corresponding data blocks are left uninitialised. 
That permitted large amount of data blocks to be allocated to the file
with high contiguity, fixing the block-intermingling problems when ext3
is writing multiple files (which reservations later addressed).

This is of course insecure, but that isn't a problem on an
embedded/consumer black box device.


ext3 sucks less nowadays, but it's still a hard vacuum.

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-04-04  8:18                                       ` Andrew Morton
@ 2009-04-04 12:40                                         ` Mark Lord
  2009-04-05  1:57                                         ` David Newall
  1 sibling, 0 replies; 664+ messages in thread
From: Mark Lord @ 2009-04-04 12:40 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jeff Garzik, Lennart Sorensen, torvalds, tytso, drees76, jesper,
	linux-kernel

Andrew Morton wrote:
> On Fri, 03 Apr 2009 14:59:12 -0400 Jeff Garzik <jeff@garzik.org> wrote:
> 
>> Lennart Sorensen wrote:
>>> On Fri, Apr 03, 2009 at 10:46:34AM -0400, Mark Lord wrote:
>>>> My Myth box here was running 2.6.18 when originally set up,
>>>> and even back then it still took *minutes* to delete large files.
>>>> So that part hasn't really changed much in the interim.
>>>>
>>>> Because of the multi-minute deletes, the distro shutdown scripts
>>>> would fails, and power off the box while it was still writing
>>>> to the drives.  Ouch.
>>>>
>>>> That system has had XFS on it for the past year and a half now,
>>>> and for Myth, there's no reason not to use XFS.  It's great!
>>> Mythtv has a 'slow delete' option that I believe works by slowly
>>> truncating the file.  Seems they believe that ext3 is bad at handling
>>> large file deletes, so they try to spread out the pain.  I don't remember
>>> if that option is on by default or not.  I turned it off.
>> It's pretty painful for super-large files with lots of metadata.
>>
> 
> yeah.
> 
> There's a dirty hack you can do where you append one byte to the file
> every 4MB, across 1GB (say).  That will then lay the file out on-disk as
> 
> one bitmap block
> one data block
> one bitmap block
> one data block
> one bitmap block
> one data block
> one bitmap block
> one data block
> <etc>
> lots-of-data-blocks
> 
> So when the time comes to delete that gigabyte, the bitmaps blocks are
> only one block apart, and reading them is much faster.
> 
> That was one of the gruesome hacks I did way back when I was in the
> streaming video recording game.
> 
> Another was the slow-delete thing.
> 
> - open the file
> 
> - unlink the file
> 
> - now sit in a loop, slowly nibbling away at the tail with
>   ftruncate() until the file is gone.
> 
> The open/unlink was there so that if the system were to crash midway,
> ext3 orphan recovery at reboot time would fully delete the remainder of
> the file.
..

That's similar to what Mythtv currently does.
Except it nibbles away in painfully tiny chunks,
so deleting takes hours that way.

Which means it's still in progress when the system
auto-shutdowns between uses.  So the delete process
gets killed, and the subsequent remount,ro and umount
calls simply fail (fs is still busy), and it then
powers off while the drive light is still solidly busy.

That's where I modified the shutdown script to check
the result code, sleep, and loop again, for up to five
minutes before pulling the plug.

But switching to xfs cured all of that.  :)

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-04-03 20:05                                                       ` Jeff Garzik
  2009-04-03 20:14                                                         ` Linus Torvalds
@ 2009-04-04 12:44                                                         ` Mark Lord
  2009-04-04 21:10                                                           ` Jeff Garzik
  1 sibling, 1 reply; 664+ messages in thread
From: Mark Lord @ 2009-04-04 12:44 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Linus Torvalds, Chris Mason, Andrew Morton, David Rees,
	Linux Kernel Mailing List

Jeff Garzik wrote:
>
> I've attached 'hdparm -I' in case anyone is curious.  It's from 
> newegg.com, so nothing NDA'd or sekrit.
>
> ATA device, with non-removable media
> 	Model Number:       G.SKILL 128GB SSD                       
> 	Serial Number:      MK0108480A545003B   
> 	Firmware Revision:  02.10104
> Standards:
> 	Used: ATA/ATAPI-7 T13 1532D revision 4a 
> 	Supported: 8 7 6 5 & some of 8
> Configuration:
> 	Logical		max	current
> 	cylinders	16383	16383
> 	heads		16	16
> 	sectors/track	63	63
> 	--
> 	CHS current addressable sectors:   16514064
> 	LBA    user addressable sectors:  250445824
> 	device size with M = 1024*1024:      122288 MBytes
> 	device size with M = 1000*1000:      128228 MBytes (128 GB)
..

That's odd.  I kind of expected to see the sector size,
cache size, and perhaps media rotation rate reported there..
Can you update your hdparm (sourceforge) and repost?

There might be other useful features of that drive,
which some of us are quite curious to know about!  :)

Thanks

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-04-03 23:48                                                               ` Jeff Garzik
@ 2009-04-04 12:46                                                                 ` Mark Lord
  2009-04-04 12:52                                                                   ` Huang Yuntao
  0 siblings, 1 reply; 664+ messages in thread
From: Mark Lord @ 2009-04-04 12:46 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Linus Torvalds, Chris Mason, Andrew Morton, Jens Axboe,
	Linux Kernel Mailing List

Jeff Garzik wrote:
> Linus Torvalds wrote:
>> "fio" does well:
>>
>>     http://git.kernel.dk/?p=fio.git;a=summary
> 
> Neat tool, Jens...
> 
> 
>> and I think it comes with a few example files. Here's the random write 
>> file that Jens suggested, and that works pretty well..
>>
>> It first creates a 2GB file to do the IO on, then does random 4k 
>> writes to it with O_DIRECT.
>>
>> If your SSD does badly at it, you'll just want to kill it, but it 
>> shows you how many MB/s it's doing (or, in the sucky case, how many 
>> kB/s).
> 
> 
> heh, so far, the SSD is poking along...
> 
> Jobs: 1 (f=1): [w] [2.5% done] [     0/   282 kb/s] [eta 02h:24m:59s]
..

Try turning on the drive write cache?   hdparm -W1


^ permalink raw reply	[flat|nested] 664+ messages in thread

* Linux 2.6.29
  2009-04-04 12:46                                                                 ` Mark Lord
@ 2009-04-04 12:52                                                                   ` Huang Yuntao
  0 siblings, 0 replies; 664+ messages in thread
From: Huang Yuntao @ 2009-04-04 12:52 UTC (permalink / raw)
  To: 'Linux Kernel Mailing List'

unsubscribe linux-kernel



^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-04-03 23:30                                                   ` Jeff Garzik
@ 2009-04-04 16:29                                                     ` Janne Grunau
  2009-04-04 23:02                                                       ` Jeff Garzik
  0 siblings, 1 reply; 664+ messages in thread
From: Janne Grunau @ 2009-04-04 16:29 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: David Rees, Mark Lord, Lennart Sorensen, Jens Axboe,
	Linus Torvalds, Ingo Molnar, Andrew Morton, tytso, jesper,
	Linux Kernel Mailing List

On Fri, Apr 03, 2009 at 07:30:37PM -0400, Jeff Garzik wrote:
> David Rees wrote:
> > The *only* reason MythTV fsyncs (or fdatasyncs) the data to disk all
> > the time is to keep a large amount of dirty pages from building up and
> > then causing horrible latencies when that data starts getting flushed
> > to disk.
> 
> sync_file_range() will definitely help that situation.

Jeff, could you please try following patch for 0.21 or update to the
latest trunk revision. I don't have a way to reproduce the high
latencies with fdatasync on ext3, data=ordered. Doing a parallel
"dd if=/dev/zero of=file" on the same partition introduces even with
sync_file_range latencies over 1 second.

Janne

---

Index: configure
===================================================================
--- configure	(revision 20302)
+++ configure	(working copy)
@@ -873,6 +873,7 @@
     sdl_video_size
     soundcard_h
     stdint_h
+    sync_file_range
     sys_poll_h
     sys_soundcard_h
     termios_h
@@ -2413,6 +2414,17 @@
 int main( void ) { return (round(3.999f) > 0)?0:1; }
 EOF
 
+# test for sync_file_range (linux only system call since 2.6.17)
+check_ld <<EOF && enable sync_file_range
+#define _GNU_SOURCE
+#include <fcntl.h>
+
+int main(int argc, char **argv){
+    sync_file_range(0,0,0,0);
+    return 0;
+}
+EOF
+
 # test for sizeof(int)
 for sizeof in 1 2 4 8 16; do
     check_cc <<EOF && _sizeof_int=$sizeof && break
Index: libs/libmythtv/ThreadedFileWriter.cpp
===================================================================
--- libs/libmythtv/ThreadedFileWriter.cpp	(revision 20302)
+++ libs/libmythtv/ThreadedFileWriter.cpp	(working copy)
@@ -18,6 +18,7 @@
 #include "ThreadedFileWriter.h"
 #include "mythcontext.h"
 #include "compat.h"
+#include "mythconfig.h"
 
 #if defined(_POSIX_SYNCHRONIZED_IO) && _POSIX_SYNCHRONIZED_IO > 0
 #define HAVE_FDATASYNC
@@ -122,6 +123,7 @@
     // file stuff
     filename(QDeepCopy<QString>(fname)), flags(pflags),
     mode(pmode),                         fd(-1),
+    m_file_sync(0),                      m_file_wpos(0),
     // state
     no_writes(false),                    flush(false),
     in_dtor(false),                      ignore_writes(false),
@@ -154,6 +156,8 @@
         buf = new char[TFW_DEF_BUF_SIZE + 1024];
         bzero(buf, TFW_DEF_BUF_SIZE + 64);
 
+        m_file_sync =  m_file_wpos = 0;
+
         tfw_buf_size = TFW_DEF_BUF_SIZE;
         tfw_min_write_size = TFW_MIN_WRITE_SIZE;
         pthread_create(&writer, NULL, boot_writer, this);
@@ -292,7 +296,22 @@
 {
     if (fd >= 0)
     {
-#ifdef HAVE_FDATASYNC
+#ifdef HAVE_SYNC_FILE_RANGE
+        uint64_t write_position;
+
+        buflock.lock();
+        write_position = m_file_wpos;
+        buflock.unlock();
+
+        if ((write_position - m_file_sync) > TFW_MAX_WRITE_SIZE ||
+            (write_position && m_file_sync < (uint64_t)tfw_min_write_size))
+        {
+            sync_file_range(fd, m_file_sync, write_position - m_file_sync,
+                            SYNC_FILE_RANGE_WRITE);
+            m_file_sync = write_position;
+        }
+
+#elif defined(HAVE_FDATASYNC)
         fdatasync(fd);
 #else
         fsync(fd);
@@ -414,6 +433,7 @@
 
         buflock.lock();
         rpos = (rpos + size) % tfw_buf_size;
+        m_file_wpos += size;
         buflock.unlock();
 
         bufferWroteData.wakeAll();
Index: libs/libmythtv/ThreadedFileWriter.h
===================================================================
--- libs/libmythtv/ThreadedFileWriter.h	(revision 20302)
+++ libs/libmythtv/ThreadedFileWriter.h	(working copy)
@@ -40,6 +40,8 @@
     int             flags;
     mode_t          mode;
     int             fd;
+    uint64_t        m_file_sync;  ///< offset synced to disk
+    uint64_t        m_file_wpos; ///< offset written to disk
 
     // state
     bool            no_writes;

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-04-04 12:44                                                         ` Mark Lord
@ 2009-04-04 21:10                                                           ` Jeff Garzik
  0 siblings, 0 replies; 664+ messages in thread
From: Jeff Garzik @ 2009-04-04 21:10 UTC (permalink / raw)
  To: Mark Lord
  Cc: Linus Torvalds, Chris Mason, Andrew Morton, David Rees,
	Linux Kernel Mailing List

[-- Attachment #1: Type: text/plain, Size: 1364 bytes --]

Mark Lord wrote:
> Jeff Garzik wrote:
>>
>> I've attached 'hdparm -I' in case anyone is curious.  It's from 
>> newegg.com, so nothing NDA'd or sekrit.
>>
>> ATA device, with non-removable media
>>     Model Number:       G.SKILL 128GB SSD                       
>>     Serial Number:      MK0108480A545003B       Firmware Revision:  
>> 02.10104
>> Standards:
>>     Used: ATA/ATAPI-7 T13 1532D revision 4a     Supported: 8 7 6 5 & 
>> some of 8
>> Configuration:
>>     Logical        max    current
>>     cylinders    16383    16383
>>     heads        16    16
>>     sectors/track    63    63
>>     --
>>     CHS current addressable sectors:   16514064
>>     LBA    user addressable sectors:  250445824
>>     device size with M = 1024*1024:      122288 MBytes
>>     device size with M = 1000*1000:      128228 MBytes (128 GB)
> ..
> 
> That's odd.  I kind of expected to see the sector size,
> cache size, and perhaps media rotation rate reported there..
> Can you update your hdparm (sourceforge) and repost?
> 
> There might be other useful features of that drive,
> which some of us are quite curious to know about!  :)

Here's output of hdparm 9.12, from Fedora rawhide.

I was unaware that both read-ahead and writeback caching were disabling 
on this drive, until that was pointed out to me in email.  huh.

I'll have to redo my tests...

	Jeff




[-- Attachment #2: ssd-hdparm.txt --]
[-- Type: text/plain, Size: 1411 bytes --]


/dev/sdb:

ATA device, with non-removable media
	Model Number:       G.SKILL 128GB SSD                       
	Serial Number:      MK0108480A545003B   
	Firmware Revision:  02.10104
Standards:
	Used: ATA/ATAPI-7 T13 1532D revision 4a 
	Supported: 8 7 6 5 & some of 8
Configuration:
	Logical		max	current
	cylinders	16383	16383
	heads		16	16
	sectors/track	63	63
	--
	CHS current addressable sectors:   16514064
	LBA    user addressable sectors:  250445824
	Logical/Physical Sector size:           512 bytes
	device size with M = 1024*1024:      122288 MBytes
	device size with M = 1000*1000:      128228 MBytes (128 GB)
	cache/buffer size  = unknown
Capabilities:
	LBA, IORDY(can be disabled)
	Standby timer values: spec'd by Standard, no device specific minimum
	R/W multiple sector transfer: Max = 1	Current = ?
	Recommended acoustic management value: 128, current value: 254
	DMA: mdma0 mdma1 mdma2 udma0 udma1 udma2 udma3 udma4 *udma5 
	     Cycle time: min=120ns recommended=120ns
	PIO: pio0 pio1 pio2 pio3 pio4 
	     Cycle time: no flow control=120ns  IORDY flow control=120ns
Commands/features:
	Enabled	Supported:
	   *	SMART feature set
	   *	Power Management feature set
	   *	Write cache
	    	Look-ahead
	   *	Mandatory FLUSH_CACHE
	   *	Gen1 signaling speed (1.5Gb/s)
	   *	Gen2 signaling speed (3.0Gb/s)
	   *	Host-initiated interface power management
	   *	Phy event counters
Checksum: correct

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-04-04 16:29                                                     ` Janne Grunau
@ 2009-04-04 23:02                                                       ` Jeff Garzik
  2009-04-05 14:20                                                         ` Janne Grunau
  2009-04-06 14:06                                                         ` Lennart Sorensen
  0 siblings, 2 replies; 664+ messages in thread
From: Jeff Garzik @ 2009-04-04 23:02 UTC (permalink / raw)
  To: Janne Grunau
  Cc: David Rees, Mark Lord, Lennart Sorensen, Jens Axboe,
	Linus Torvalds, Ingo Molnar, Andrew Morton, tytso, jesper,
	Linux Kernel Mailing List

Janne Grunau wrote:
> On Fri, Apr 03, 2009 at 07:30:37PM -0400, Jeff Garzik wrote:
>> David Rees wrote:
>>> The *only* reason MythTV fsyncs (or fdatasyncs) the data to disk all
>>> the time is to keep a large amount of dirty pages from building up and
>>> then causing horrible latencies when that data starts getting flushed
>>> to disk.
>> sync_file_range() will definitely help that situation.
> 
> Jeff, could you please try following patch for 0.21 or update to the
> latest trunk revision. I don't have a way to reproduce the high
> latencies with fdatasync on ext3, data=ordered. Doing a parallel
> "dd if=/dev/zero of=file" on the same partition introduces even with
> sync_file_range latencies over 1 second.

Is dd + sync_file_range really a realistic comparison?  dd is streaming 
as fast as the disk can output data, whereas MythTV is streaming as fast 
as video is being recorded.  If you are maxing out your disk throughput, 
there will be obvious impact no matter what.

I would think a more accurate comparison would be recording multiple 
video streams in parallel, comparing fsync/fdatasync/sync_file_range?

IOW, what is an average MythTV setup -- what processes are actively 
reading/writing storage?  Where are you noticing latencies, and does 
sync_file_range decrease those areas of high latency?

	Jeff




^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-04-04  8:18                                       ` Andrew Morton
  2009-04-04 12:40                                         ` Mark Lord
@ 2009-04-05  1:57                                         ` David Newall
  2009-04-05  3:46                                           ` Mark Lord
  1 sibling, 1 reply; 664+ messages in thread
From: David Newall @ 2009-04-05  1:57 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jeff Garzik, Lennart Sorensen, Mark Lord, torvalds, tytso,
	drees76, jesper, linux-kernel

Andrew Morton wrote:
> - open the file
>
> - unlink the file
>
> - now sit in a loop, slowly nibbling away at the tail with
>   ftruncate() until the file is gone.

Why not fork and unlink in the child?

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-04-05  1:57                                         ` David Newall
@ 2009-04-05  3:46                                           ` Mark Lord
  0 siblings, 0 replies; 664+ messages in thread
From: Mark Lord @ 2009-04-05  3:46 UTC (permalink / raw)
  To: David Newall
  Cc: Andrew Morton, Jeff Garzik, Lennart Sorensen, torvalds, tytso,
	drees76, jesper, linux-kernel

David Newall wrote:
> Andrew Morton wrote:
>> - open the file
>>
>> - unlink the file
>>
>> - now sit in a loop, slowly nibbling away at the tail with
>>   ftruncate() until the file is gone.
> 
> Why not fork and unlink in the child?
..

I think it does the equivalent of that today.

Problem is, if you do the unlink without the nibbling,
then the disk locks up the system cold for 2-3 minutes
until the disk delete actually completes.

-ml

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-04-04 23:02                                                       ` Jeff Garzik
@ 2009-04-05 14:20                                                         ` Janne Grunau
  2009-04-06 14:06                                                         ` Lennart Sorensen
  1 sibling, 0 replies; 664+ messages in thread
From: Janne Grunau @ 2009-04-05 14:20 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: David Rees, Mark Lord, Lennart Sorensen, Jens Axboe,
	Linus Torvalds, Ingo Molnar, Andrew Morton, tytso, jesper,
	Linux Kernel Mailing List

On Sat, Apr 04, 2009 at 07:02:51PM -0400, Jeff Garzik wrote:
> Janne Grunau wrote:
> > 
> > Jeff, could you please try following patch for 0.21 or update to the
> > latest trunk revision. I don't have a way to reproduce the high
> > latencies with fdatasync on ext3, data=ordered. Doing a parallel
> > "dd if=/dev/zero of=file" on the same partition introduces even with
> > sync_file_range latencies over 1 second.
> 
> Is dd + sync_file_range really a realistic comparison?  dd is streaming 
> as fast as the disk can output data, whereas MythTV is streaming as fast 
> as video is being recorded.  If you are maxing out your disk throughput, 
> there will be obvious impact no matter what.

sure, I tried simulating a case where the fsync/fdatasync from mythtv
impose high latencies on other processes due to syncing other big writes
too.

> I would think a more accurate comparison would be recording multiple 
> video streams in parallel, comparing fsync/fdatasync/sync_file_range?

I tested 3 simultaneous recordings and haven't noticed a difference. I'm
even sure if I should. With multiple recording at the same time mythtv
would also call fdatasync multiple times per second.

I guess I could compare how long fdatasync and sync_file_range with
SYNC_FILE_RANGE_WAIT_BEFORE | SYNC_FILE_RANGE_WRITE |
SYNC_FILE_RANGE_WAIT_AFTER are blocking. Not that mythtv would care
since they are called in it's own thread.

> IOW, what is an average MythTV setup -- what processes are actively 
> reading/writing storage?

writing:
mythbackend - recordings and preview images (2-20mbps)

reading:
mythfrontend - viewing (2-20mbps)
mythcommflag - faster than viewing, maybe up to 50mbps (depending on cpu)

writing+reading:
mythtranscode - combined rate less than 50mbps, usually more reads than
                writes (depending on cpu)
mythtranscode (lossless) - around maximal disk throughput

> Where are you noticing latencies, and does 
> sync_file_range decrease those areas of high latency?

I don't notice latencies in mythtv, at least no for which file systems
or the block layer can be blamed for. But my setup is build to avoid
these. Mythtv records to it's own disks formatted with xfs. Mythtv
generally tries to spread simultaneous recodings over different file
systems. The tests were on a different system though.

Janne

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-04-04 23:02                                                       ` Jeff Garzik
  2009-04-05 14:20                                                         ` Janne Grunau
@ 2009-04-06 14:06                                                         ` Lennart Sorensen
  1 sibling, 0 replies; 664+ messages in thread
From: Lennart Sorensen @ 2009-04-06 14:06 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Janne Grunau, David Rees, Mark Lord, Jens Axboe, Linus Torvalds,
	Ingo Molnar, Andrew Morton, tytso, jesper,
	Linux Kernel Mailing List

On Sat, Apr 04, 2009 at 07:02:51PM -0400, Jeff Garzik wrote:
> Is dd + sync_file_range really a realistic comparison?  dd is streaming  
> as fast as the disk can output data, whereas MythTV is streaming as fast  
> as video is being recorded.  If you are maxing out your disk throughput,  
> there will be obvious impact no matter what.
>
> I would think a more accurate comparison would be recording multiple  
> video streams in parallel, comparing fsync/fdatasync/sync_file_range?
>
> IOW, what is an average MythTV setup -- what processes are actively  
> reading/writing storage?  Where are you noticing latencies, and does  
> sync_file_range decrease those areas of high latency?

I am going to give the patch a shot.  I run dual tuners after all, so
I do get multiple streams recording while doing playback at the same time.

-- 
Len Sorensen

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-04-03 22:28                                   ` Jeff Moyer
@ 2009-04-06 14:15                                     ` Lennart Sorensen
  2009-04-06 21:27                                       ` Mark Lord
  0 siblings, 1 reply; 664+ messages in thread
From: Lennart Sorensen @ 2009-04-06 14:15 UTC (permalink / raw)
  To: Jeff Moyer
  Cc: Ingo Molnar, Andrew Morton, torvalds, tytso, drees76, jesper,
	linux-kernel, jens.axboe

On Fri, Apr 03, 2009 at 06:28:36PM -0400, Jeff Moyer wrote:
> Could you try one more test, please?  Switch back to CFQ and set
> /sys/block/sdX/queue/iosched/slice_idle to 0?

I actually am running cfq at the moment, but with Mark's (I think it
was) preload library to change fsync calls to at most one per 5 seconds
instead of 10 per second.  So far that has certainly made things a lot
better as far as I can tell.  Maybe not as good as anticipatory seemed
to be but certainly better.

I can try your suggestion too.

I set sda-sdd to 0.  I removed the preload library from the mythbackend.

> I'm not sure how the applications you are running write to disk, but if
> they interleave I/O between processes, this could help.  I'm not too
> confident that this will make a difference, though, since CFQ changed to
> time-slice based instead of quantum based before 2.6.18.  Still, it
> would be another data point if you have the time.

Well when recording two shows at once, there will be two processes
streaming to seperate files, and usually there will be two commercial
flagging processes following behind reading those files and doing mysql
updates as they go.

> Thanks in advance!

No problem.  If it solves this bad behaviour, it will be all worth it.

-- 
Len Sorensen

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-04-06 14:15                                     ` Lennart Sorensen
@ 2009-04-06 21:27                                       ` Mark Lord
  2009-04-06 21:56                                         ` Lennart Sorensen
  0 siblings, 1 reply; 664+ messages in thread
From: Mark Lord @ 2009-04-06 21:27 UTC (permalink / raw)
  To: Lennart Sorensen
  Cc: Jeff Moyer, Ingo Molnar, Andrew Morton, torvalds, tytso, drees76,
	jesper, linux-kernel, jens.axboe

Lennart Sorensen wrote:
> On Fri, Apr 03, 2009 at 06:28:36PM -0400, Jeff Moyer wrote:
>> Could you try one more test, please?  Switch back to CFQ and set
>> /sys/block/sdX/queue/iosched/slice_idle to 0?
> 
> I actually am running cfq at the moment, but with Mark's (I think it
> was) preload library to change fsync calls to at most one per 5 seconds
> instead of 10 per second.  So far that has certainly made things a lot
> better as far as I can tell.  Maybe not as good as anticipatory seemed
> to be but certainly better.
> 
> I can try your suggestion too.
..

Yeah, I think the sync_file_range() patch is the way to go.
It seems to be smooth enough here with four or five simultaneous recordings,
a couple of commflaggers, and an HD playback all happening at once.

Cheers

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-04-03  8:15                                       ` Ingo Molnar
@ 2009-04-06 21:46                                         ` Bill Davidsen
  0 siblings, 0 replies; 664+ messages in thread
From: Bill Davidsen @ 2009-04-06 21:46 UTC (permalink / raw)
  To: linux-kernel
  Cc: Jens Axboe, Nick Piggin, Linus Torvalds, Lennart Sorensen,
	Andrew Morton, tytso, drees76, jesper, Linux Kernel Mailing List,
	Peter Zijlstra

Ingo Molnar wrote:

> Ergo, i think pluggable designs for something as critical and as 
> central as IO scheduling has its clear downsides as it created two 
> mediocre schedulers:
> 
>  - CFQ with all the modern features but performance problems on 
>    certain workloads
> 
>  - Anticipatory with legacy features only but works (much!) better 
>    on some workloads.
> 
> ... instead of giving us just a single well-working CFQ scheduler.
> 
> This, IMHO, in its current form, seems to trump the upsides of IO 
> schedulers.
> 
> So i do think that late during development (i.e. now), _years_ down 
> the line, we should make it gradually harder for people to use AS.
> 
I rarely disagree with you, and more rarely feel like arguing a point in public, 
but you are basing your whole opinion on the premise that it is possible to have 
one io scheduler which handles all cases. And that seems obviously wrong, 
because you address different types of activity with tuning or adapting, in some 
cases you need a whole different approach, and you need to lock in that approach 
even if some metric says something else would be better for the "better" seen by 
the developer rather than the user.


> What do you think?
> 
I think that by trying to create "one size fits all" you will hit a significant 
number of cases where it really doesn't fit well and you have so many tuning 
features both automatic and manual that you wind up with code which is big, 
inefficient, confusing to tune, hard to maintain, and generally not optimal for 
any one thing.

What we have is easy to test and the behavior is different enough in most cases 
that you can tell which is best, or at least that a change didn't help. I have 
watched long threads and chats about tuning VM (dirty_*, swappiness, etc) to be 
aware that in most cases either faster disk or more memory is the answer, not 
tuning to be "less unsatisfactory." Several distinct io schedulers is good, one 
complex bland one would not be.

-- 
Bill Davidsen <davidsen@tmr.com>
   "We have more to fear from the bungling of the incompetent than from
the machinations of the wicked."  - from Slashdot


^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-04-06 21:27                                       ` Mark Lord
@ 2009-04-06 21:56                                         ` Lennart Sorensen
  0 siblings, 0 replies; 664+ messages in thread
From: Lennart Sorensen @ 2009-04-06 21:56 UTC (permalink / raw)
  To: Mark Lord
  Cc: Jeff Moyer, Ingo Molnar, Andrew Morton, torvalds, tytso, drees76,
	jesper, linux-kernel, jens.axboe

On Mon, Apr 06, 2009 at 05:27:10PM -0400, Mark Lord wrote:
> Yeah, I think the sync_file_range() patch is the way to go.
> It seems to be smooth enough here with four or five simultaneous recordings,
> a couple of commflaggers, and an HD playback all happening at once.

Well would be worth a try.  So far I am not sure if the slice_idle works
or not.  I will have to try playback when I get home and see how it feels.

-- 
Len Sorensen

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: [PATCH 2/2] Make relatime default
  2009-03-26 22:27                                                   ` Linus Torvalds
  2009-03-27  0:15                                                     ` Frans Pop
  2009-03-27  2:05                                                     ` Frans Pop
@ 2009-04-09 20:13                                                     ` Pavel Machek
  2009-04-09 20:47                                                       ` Linus Torvalds
  2 siblings, 1 reply; 664+ messages in thread
From: Pavel Machek @ 2009-04-09 20:13 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Alan Cox, Matthew Garrett, Theodore Tso, Ingo Molnar, Jan Kara,
	Andrew Morton, Arjan van de Ven, Peter Zijlstra, Nick Piggin,
	Jens Axboe, David Rees, Jesper Krogh, Linux Kernel Mailing List,
	Oleg Nesterov, Roland McGrath

On Thu 2009-03-26 15:27:37, Linus Torvalds wrote:
> 
> 
> On Thu, 26 Mar 2009, Alan Cox wrote:
> > 
> > NAK this again
> 
> And I don't care.
> 
> If the distro's had done this right in the year+ that this has been in, I 
> m ight consider your NAK to have some weight.

'No changes of ABI in stable series?' If you tweaked defaults for
ext4, before it was widely used... that would be acceptable I
guess. By breaking old setups with kernel change is bad.

									Pavel 
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: [PATCH 2/2] Make relatime default
  2009-04-09 20:13                                                     ` Pavel Machek
@ 2009-04-09 20:47                                                       ` Linus Torvalds
  2009-04-09 21:15                                                         ` Pavel Machek
  0 siblings, 1 reply; 664+ messages in thread
From: Linus Torvalds @ 2009-04-09 20:47 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Alan Cox, Matthew Garrett, Theodore Tso, Ingo Molnar, Jan Kara,
	Andrew Morton, Arjan van de Ven, Peter Zijlstra, Nick Piggin,
	Jens Axboe, David Rees, Jesper Krogh, Linux Kernel Mailing List,
	Oleg Nesterov, Roland McGrath



On Thu, 9 Apr 2009, Pavel Machek wrote:
> 
> 'No changes of ABI in stable series?'

(a) we don't have a stable series any more
and
(b) this isn't an abi change, it's a system management change.

If you don't think we can make those, then I assume that you also claim 
that we can't do thigns like commit 1b5e62b42, which doubled the writeback 
dirty thresholds, or any of the things that changed how dirty accounting 
was done in the first place?

I would _love_ for distros to do the sane thing, but they don't. That's a 
fact.

			Linus

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: [PATCH 2/2] Make relatime default
  2009-04-09 20:47                                                       ` Linus Torvalds
@ 2009-04-09 21:15                                                         ` Pavel Machek
  2009-04-09 21:20                                                           ` Linus Torvalds
  0 siblings, 1 reply; 664+ messages in thread
From: Pavel Machek @ 2009-04-09 21:15 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Alan Cox, Matthew Garrett, Theodore Tso, Ingo Molnar, Jan Kara,
	Andrew Morton, Arjan van de Ven, Peter Zijlstra, Nick Piggin,
	Jens Axboe, David Rees, Jesper Krogh, Linux Kernel Mailing List,
	Oleg Nesterov, Roland McGrath

On Thu 2009-04-09 13:47:37, Linus Torvalds wrote:
> 
> 
> On Thu, 9 Apr 2009, Pavel Machek wrote:
> > 
> > 'No changes of ABI in stable series?'
> 
> (a) we don't have a stable series any more
> and
> (b) this isn't an abi change, it's a system management change.

Well, stat() syscall no longer returns sane value in st_atime, while
all the userland stayed the same; only kernel changed. I believe that
is ABI change.

> If you don't think we can make those, then I assume that you also claim 
> that we can't do thigns like commit 1b5e62b42, which doubled the writeback 
> dirty thresholds, or any of the things that changed how dirty accounting 
> was done in the first place?

Writeback dirty thresholds will only change timing, that was not part
of ABI. st_atime field is.

> I would _love_ for distros to do the sane thing, but they don't. That's a 
> fact.

But is this a way to do it? Are there maybe better ways? 

a) Publicly call those distros broken? 

b) Add nasty printk() to mount to force their attention?

	bb) Add nasty printk() and mdelay(1000) to really force their
		attention? :-)

c) Modify mount command to do the dirty work instead of changing
default in kernel?

	[as mountflags are not passed as a string by sys_mount(), you
	are creating pretty nasty situation for users; users with old
	distro but new kernel will not be even able to get old
	behaviour back w/o updating /sbin/mount. This would prevent
	it].
								Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: [PATCH 2/2] Make relatime default
  2009-04-09 21:15                                                         ` Pavel Machek
@ 2009-04-09 21:20                                                           ` Linus Torvalds
  2009-04-09 22:00                                                             ` Pavel Machek
  0 siblings, 1 reply; 664+ messages in thread
From: Linus Torvalds @ 2009-04-09 21:20 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Alan Cox, Matthew Garrett, Theodore Tso, Ingo Molnar, Jan Kara,
	Andrew Morton, Arjan van de Ven, Peter Zijlstra, Nick Piggin,
	Jens Axboe, David Rees, Jesper Krogh, Linux Kernel Mailing List,
	Oleg Nesterov, Roland McGrath



On Thu, 9 Apr 2009, Pavel Machek wrote:
> 
> Well, stat() syscall no longer returns sane value in st_atime, while
> all the userland stayed the same; only kernel changed. I believe that
> is ABI change.

No. 

If you want the old abi, use "strictatime" in your /etc/fstime.

No ABI changed. Just the default mount options changed to be what most 
people (especially non-specialists) would likely want.

Deal with it.

		Linus

^ permalink raw reply	[flat|nested] 664+ messages in thread

* updated: ext3 IO latency measurements on v2.6.30-rc1
  2009-03-26  9:06                               ` ext3 IO latency measurements (was: Linux 2.6.29) Ingo Molnar
                                                   ` (5 preceding siblings ...)
  2009-03-26 18:11                                 ` Jan Kara
@ 2009-04-09 21:59                                 ` Ingo Molnar
  2009-04-10  7:34                                   ` Heinz Diehl
  2009-05-18 16:37                                   ` Sanjoy Mahajan
  6 siblings, 2 replies; 664+ messages in thread
From: Ingo Molnar @ 2009-04-09 21:59 UTC (permalink / raw)
  To: Jan Kara, Linus Torvalds, Jens Axboe, Theodore Ts'o
  Cc: Andrew Morton, Alan Cox, Arjan van de Ven, Peter Zijlstra,
	Nick Piggin, David Rees, Jesper Krogh, Linux Kernel Mailing List,
	Oleg Nesterov, Roland McGrath


This is an update to the ext3+CFQ latency measurements i did early 
in the merge window - originally with a v2.6.29 based kernel.

Today i've repeated my measurements under v2.6.30-rc1, using the 
exact same system and the exact same workload.

The quick executive summary:

   _what a difference a week of upstream development makes_!

:-)

Here are the specific details, as a reply to my earlier mail:

* Ingo Molnar <mingo@elte.hu> wrote:

> * Jan Kara <jack@suse.cz> wrote:
> 
> > > So tell me again how the VM can rely on the filesystem not 
> > > blocking at random points.
> >
> >   I can write a patch to make writepage() in the non-"mmapped 
> > creation" case non-blocking on journal. But I'll also have to find 
> > out whether it really helps something. But it's probably worth 
> > trying...
> 
> _all_ the problems i ever had with ext3 were 'collateral damage' 
> type of things: simple writes (sometimes even reads) getting 
> serialized on some large [but reasonable] dirtying activity 
> elsewhere - even if the system was still well within its 
> hard-dirty-limit threshold.
> 
> So it sure sounds like an area worth improving, and it's not that 
> hard to reproduce either. Take a system with enough RAM but only a 
> single disk, and do this in a kernel tree:
> 
>   sync
>   echo 3 > /proc/sys/vm/drop_caches
> 
>   while :; do
>     date
>     make mrproper      2>/dev/null >/dev/null
>     make defconfig     2>/dev/null >/dev/null
>     make -j32 bzImage  2>/dev/null >/dev/null
>   done &
> 
> Plain old kernel build, no distcc and no icecream. Wait a few 
> minutes for the system to reach equilibrium. There's no tweaking 
> anywhere, kernel, distro and filesystem defaults used everywhere:
> 
>  aldebaran:/home/mingo/linux/linux> ./compile-test 
>  Thu Mar 26 10:33:03 CET 2009
>  Thu Mar 26 10:35:24 CET 2009
>  Thu Mar 26 10:36:48 CET 2009
>  Thu Mar 26 10:38:54 CET 2009
>  Thu Mar 26 10:41:22 CET 2009
>  Thu Mar 26 10:43:41 CET 2009
>  Thu Mar 26 10:46:02 CET 2009
>  Thu Mar 26 10:48:28 CET 2009
> 
> And try to use the system while this workload is going on. Use Vim 
> to edit files in this kernel tree. Use plain _cat_ - and i hit 
> delays all the time - and it's not the CPU scheduler but all IO 
> related.

Under .30-rc1 i couldnt hit a single (!) annoying delay during half 
an hour of trying. The "Vim experience" is _totally_ smooth with a 
load average of 40+.

And this is with default, untweaked ext3 - not even ext4. I'm 
impressed.

> I have such an ext3 based system where i can do such tests and 
> where i dont mind crashes and data corruption either, so if you 
> send me experimental patches against latet -git i can try them 
> immediately. The system has 16 CPUs, 12GB of RAM and a single 
> disk.
> 
> Btw., i had this test going on that box while i wrote some simple 
> scripts in Vim - and it was a horrible experience. The worst wait 
> was well above one minute - Vim just hung there indefinitely. Not 
> even Ctrl-Z was possible. I captured one such wait, it was hanging 
> right here:
> 
>  aldebaran:~/linux/linux> cat /proc/3742/stack
>  [<ffffffff8034790a>] log_wait_commit+0xbd/0x110
>  [<ffffffff803430b2>] journal_stop+0x1df/0x20d
>  [<ffffffff8034421f>] journal_force_commit+0x28/0x2d
>  [<ffffffff80331c69>] ext3_force_commit+0x2b/0x2d
>  [<ffffffff80328b56>] ext3_write_inode+0x3e/0x44
>  [<ffffffff802ebb9d>] __sync_single_inode+0xc1/0x2ad
>  [<ffffffff802ebed6>] __writeback_single_inode+0x14d/0x15a
>  [<ffffffff802ebf0c>] sync_inode+0x29/0x34
>  [<ffffffff80327453>] ext3_sync_file+0xa7/0xb4
>  [<ffffffff802ef17d>] vfs_fsync+0x78/0xaf
>  [<ffffffff802ef1eb>] do_fsync+0x37/0x4d
>  [<ffffffff802ef228>] sys_fsync+0x10/0x14
>  [<ffffffff8020bd1b>] system_call_fastpath+0x16/0x1b
>  [<ffffffffffffffff>] 0xffffffffffffffff
> 
> It took about 120 seconds for it to recover.

These delays are definitely below 300 msecs now. (100 msecs is 
roughly the lag i can still notice in typing)

> And it's not just sys_fsync(). The script i wrote tests file read 
> latencies. I have created 1000 files with the same size (all copies 
> of kernel/sched.c ;-), and tested their cache-cold plain-cat 
> performance via:
> 
>   for ((i=0;i<1000;i++)); do
>     printf "file #%4d, plain reading it took: " $i
>     /usr/bin/time -f "%e seconds."  cat $i >/dev/null
>   done
> 
> I.e. plain, supposedly high-prio reads. The result is very common 
> hickups in read latencies:
> 
> file # 579 (253560 bytes), reading it took: 0.08 seconds.
> file # 580 (253560 bytes), reading it took: 0.05 seconds.
> file # 581 (253560 bytes), reading it took: 0.01 seconds.
> file # 582 (253560 bytes), reading it took: 0.01 seconds.
> file # 583 (253560 bytes), reading it took: 4.61 seconds.
> file # 584 (253560 bytes), reading it took: 1.29 seconds.
> file # 585 (253560 bytes), reading it took: 3.01 seconds.
> file # 586 (253560 bytes), reading it took: 7.74 seconds.
> file # 587 (253560 bytes), reading it took: 3.22 seconds.
> file # 588 (253560 bytes), reading it took: 0.05 seconds.
> file # 589 (253560 bytes), reading it took: 0.36 seconds.
> file # 590 (253560 bytes), reading it took: 7.39 seconds.
> file # 591 (253560 bytes), reading it took: 7.58 seconds.
> file # 592 (253560 bytes), reading it took: 7.90 seconds.
> file # 593 (253560 bytes), reading it took: 8.78 seconds.
> file # 594 (253560 bytes), reading it took: 8.01 seconds.
> file # 595 (253560 bytes), reading it took: 7.47 seconds.
> file # 596 (253560 bytes), reading it took: 11.52 seconds.
> file # 597 (253560 bytes), reading it took: 10.33 seconds.
> file # 598 (253560 bytes), reading it took: 8.56 seconds.
> file # 599 (253560 bytes), reading it took: 7.58 seconds.

This test is totally smooth now:

file #   6 (253560 bytes), reading it took: 0.05 seconds.
file #   7 (253560 bytes), reading it took: 0.11 seconds.
file #   8 (253560 bytes), reading it took: 0.12 seconds.
file #   9 (253560 bytes), reading it took: 0.06 seconds.
file #  10 (253560 bytes), reading it took: 0.05 seconds.
file #  11 (253560 bytes), reading it took: 0.11 seconds.
file #  12 (253560 bytes), reading it took: 0.09 seconds.
file #  13 (253560 bytes), reading it took: 0.09 seconds.
file #  14 (253560 bytes), reading it took: 0.03 seconds.
file #  15 (253560 bytes), reading it took: 0.08 seconds.
file #  16 (253560 bytes), reading it took: 0.15 seconds.
file #  17 (253560 bytes), reading it took: 0.06 seconds.
file #  18 (253560 bytes), reading it took: 0.13 seconds.
file #  19 (253560 bytes), reading it took: 0.16 seconds.
file #  20 (253560 bytes), reading it took: 0.29 seconds.
file #  21 (253560 bytes), reading it took: 0.18 seconds.
file #  22 (253560 bytes), reading it took: 0.28 seconds.
file #  23 (253560 bytes), reading it took: 0.04 seconds.

290 msecs was the worst in thes series above.

The vim read+write test takes longer:

aldebaran:~/linux/linux/test-files/src> ./vim-test
file #   0 (253560 bytes), Vim-opening it took: 2.35 seconds.
file #   1 (253560 bytes), Vim-opening it took: 2.09 seconds.
file #   2 (253560 bytes), Vim-opening it took: 2.20 seconds.
file #   3 (253560 bytes), Vim-opening it took: 2.14 seconds.
file #   4 (253560 bytes), Vim-opening it took: 2.15 seconds.
file #   5 (253560 bytes), Vim-opening it took: 2.13 seconds.
file #   6 (253560 bytes), Vim-opening it took: 2.11 seconds.
file #   7 (253560 bytes), Vim-opening it took: 2.09 seconds.
file #   8 (253560 bytes), Vim-opening it took: 2.03 seconds.
file #   9 (253560 bytes), Vim-opening it took: 2.03 seconds.
file #  10 (253560 bytes), Vim-opening it took: 2.06 seconds.
file #  11 (253560 bytes), Vim-opening it took: 2.19 seconds.
file #  12 (253560 bytes), Vim-opening it took: 2.07 seconds.
file #  13 (253560 bytes), Vim-opening it took: 2.02 seconds.

I suspect that is to be expected? The test does:

   vim -c "q" $i 2>/dev/null >/dev/null

2 seconds is OK-ish - when close+writing a file i mentally expect 
some short delay anyway. I think vim fsyncs the swap file in that 
case as well.

I havent actually experienced any such delays while editing files.

> The system's RAM is ridiculously under-utilized, 96.1% is free, only 
> 3.9% is utilized:
> 
>               total       used       free     shared    buffers     cached
>  Mem:      12318192     476732   11841460          0      48324     142936
>  -/+ buffers/cache:     285472   12032720
>  Swap:      4096564          0    4096564
> 
> Dirty data in /proc/meminfo fluctuates between 0.4% and 1.6% of 
> total RAM. (the script removes the freshly build kernel object 
> files, so the workload is pretty steady.)
> 
> The peak of 1.6% looks like this:
> 
> Dirty:            118376 kB
> Dirty:            143784 kB
> Dirty:            161756 kB
> Dirty:            185084 kB
> Dirty:            210524 kB
> Dirty:            213348 kB
> Dirty:            200124 kB
> Dirty:            122152 kB
> Dirty:            121508 kB
> Dirty:            121512 kB
> 
> (1 second snapshots)
> 
> So the problems are all around the place and they are absolutely, 
> trivially reproducible. And this is how a default ext3 based distro 
> and the default upstream kernel will present itself to new Linux 
> users and developers. It's not a pretty experience.
> 
> Oh, and while at it - also a job control complaint. I tried to 
> Ctrl-C the above script:
> 
> file # 858 (253560 bytes), reading it took: 0.06 seconds.
> file # 859 (253560 bytes), reading it took: 0.02 seconds.
> file # 860 (253560 bytes), reading it took: 5.53 seconds.
> file # 861 (253560 bytes), reading it took: 3.70 seconds.
> file # 862 (253560 bytes), reading it took: 0.88 seconds.
> file # 863 (253560 bytes), reading it took: 0.04 seconds.
> file # 864 (253560 bytes), reading it took: ^C0.69 seconds.
> file # 865 (253560 bytes), reading it took: ^C0.49 seconds.
> file # 866 (253560 bytes), reading it took: ^C0.01 seconds.
> file # 867 (253560 bytes), reading it took: ^C0.02 seconds.
> file # 868 (253560 bytes), reading it took: ^C^C0.01 seconds.
> file # 869 (253560 bytes), reading it took: ^C^C0.04 seconds.
> file # 870 (253560 bytes), reading it took: ^C^C^C0.03 seconds.
> file # 871 (253560 bytes), reading it took: ^C0.02 seconds.
> file # 872 (253560 bytes), reading it took: ^C^C0.02 seconds.
> file # 873 (253560 bytes), reading it took: 
> ^C^C^C^Caldebaran:~/linux/linux/test-files/src> 
> 
> I had to hit Ctrl-C numerous times before Bash would honor it. 
> This to is a very common thing on large SMP systems. I'm willing 
> to test patches until all these problems are fixed. Any takers?

This Bash bug still occurs and is highly annoying when using scripts 
on SMP Linux boxes and trying to Ctrl-C out of them.

But all in one, the ext3 and IO latency problems seem to be 
thoroughly cured!

To me the past 1 week has made more difference in filesystems and IO 
interactivity than all filesystems development done in the past 5+ 
years, combined. Kudos!

	Ingo

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: [PATCH 2/2] Make relatime default
  2009-04-09 21:20                                                           ` Linus Torvalds
@ 2009-04-09 22:00                                                             ` Pavel Machek
  0 siblings, 0 replies; 664+ messages in thread
From: Pavel Machek @ 2009-04-09 22:00 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Alan Cox, Matthew Garrett, Theodore Tso, Ingo Molnar, Jan Kara,
	Andrew Morton, Arjan van de Ven, Peter Zijlstra, Nick Piggin,
	Jens Axboe, David Rees, Jesper Krogh, Linux Kernel Mailing List,
	Oleg Nesterov, Roland McGrath

On Thu 2009-04-09 14:20:40, Linus Torvalds wrote:
> 
> 
> On Thu, 9 Apr 2009, Pavel Machek wrote:
> > 
> > Well, stat() syscall no longer returns sane value in st_atime, while
> > all the userland stayed the same; only kernel changed. I believe that
> > is ABI change.
> 
> No. 
> 
> If you want the old abi, use "strictatime" in your /etc/fstime.
  ~~~~~~~~~~~~~~~~~~~~~~~~

So you agree it is an ABI change :-).

> No ABI changed. Just the default mount options changed to be what most 
> people (especially non-specialists) would likely want.

Maybe.

But lets see what awaits me with 2.6.30-rc1 update:

root@amd:~# mount /data -oremount,noatime
root@amd:~# mount /data -oremount,strictatime
mount: /data not mounted already, or bad option

...oh no, my mount is too old. My mount seems to be up-to-date with
debian testing. Should I have to install mount from sources just to
keep the compatible system settings?

There has to be a better way.
									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: updated: ext3 IO latency measurements on v2.6.30-rc1
  2009-04-09 21:59                                 ` updated: ext3 IO latency measurements on v2.6.30-rc1 Ingo Molnar
@ 2009-04-10  7:34                                   ` Heinz Diehl
  2009-05-18 16:37                                   ` Sanjoy Mahajan
  1 sibling, 0 replies; 664+ messages in thread
From: Heinz Diehl @ 2009-04-10  7:34 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Jan Kara, Linus Torvalds, Jens Axboe, Theodore Ts'o,
	Andrew Morton, Alan Cox, Arjan van de Ven, Peter Zijlstra,
	Nick Piggin, David Rees, Jesper Krogh, Linux Kernel Mailing List,
	Oleg Nesterov, Roland McGrath

On 10.04.2009, Ingo Molnar wrote: 

> This is an update to the ext3+CFQ latency measurements i did early 
> in the merge window - originally with a v2.6.29 based kernel.
> 
> Today i've repeated my measurements under v2.6.30-rc1, using the 
> exact same system and the exact same workload.
> 
> The quick executive summary:
> 
>    _what a difference a week of upstream development makes_!

Here's a quicktest with xfs+CFQ on one of my testing machines, 
using fsync-tester while running Linus' "bigfile torture test". 
It underlines your results, showing significant improvement!

The xfs partition is mounted with the defaults, and I'm expecting 
slightly more improvement after mounting with noatime and nobarrier.

2.6.30-rc1

fsync time: 0.4674
fsync time: 1.0473
fsync time: 0.4190
fsync time: 1.0800
fsync time: 1.0132
fsync time: 1.0193
fsync time: 1.0191
fsync time: 1.1318
fsync time: 0.9924
fsync time: 1.0568
fsync time: 1.0676
fsync time: 1.0241
fsync time: 1.0530
fsync time: 0.9709
fsync time: 0.4475
fsync time: 0.6320
fsync time: 1.0906
fsync time: 0.6344
fsync time: 1.0632
fsync time: 1.0455
fsync time: 1.0530
fsync time: 1.0655
fsync time: 1.0032
fsync time: 1.0644
fsync time: 1.1573
fsync time: 1.0197
fsync time: 1.0342
fsync time: 1.0643
fsync time: 0.0342
fsync time: 0.7603
fsync time: 1.0905
fsync time: 0.6340

2.6.29.1

fsync time: 2.1255
fsync time: 2.2851
fsync time: 1.9048
fsync time: 1.0999
fsync time: 2.0117
fsync time: 2.0819
fsync time: 2.0819
fsync time: 0.0225
fsync time: 0.2796
fsync time: 0.3879
fsync time: 0.6584
fsync time: 0.9287
fsync time: 0.2488
fsync time: 2.0994
fsync time: 2.0161
fsync time: 1.9736
fsync time: 2.0231
fsync time: 2.2888
fsync time: 2.1719
fsync time: 1.8452
fsync time: 0.3278
fsync time: 1.0881
fsync time: 0.5202
fsync time: 1.3339
fsync time: 0.4295
fsync time: 1.2772
fsync time: 1.9436
fsync time: 2.1048
fsync time: 1.9376
fsync time: 2.0786
fsync time: 1.9202


^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: updated: ext3 IO latency measurements on v2.6.30-rc1
  2009-04-09 21:59                                 ` updated: ext3 IO latency measurements on v2.6.30-rc1 Ingo Molnar
  2009-04-10  7:34                                   ` Heinz Diehl
@ 2009-05-18 16:37                                   ` Sanjoy Mahajan
  1 sibling, 0 replies; 664+ messages in thread
From: Sanjoy Mahajan @ 2009-05-18 16:37 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Jan Kara, Linus Torvalds, Jens Axboe, Theodore Ts'o,
	Andrew Morton, Alan Cox, Arjan van de Ven, Peter Zijlstra,
	Nick Piggin, David Rees, Jesper Krogh, Linux Kernel Mailing List,
	Oleg Nesterov, Roland McGrath

To test the ext3 improvements in 2.6.30-rc5, I tried my previous test of
starting an rxvt while findutils or 'aptitude dist-upgrade' is madly
spinning the disk.

With vanilla 2.6.29 in those circumstances, starting an rxvt took 4
seconds on an otherwise unloaded system (Thinkpad T60 w/ 1.83GHz
dual-core CPU, 1.5GB RAM, 5400 rpm drive).  The second rxvt came up
right away.  So the first rxvt probably stalled until /usr/bin/rxvt
could be grabbed from disk.

With 2.6.30-rc5, the rxvt showed up much faster -- either right away
(maybe it was still in the cache) or after about 0.5 seconds (probably
the uncached case).

Other tests were also snappy, like opening a 10MB PDF file in Emacs,
adding a newline at the beginning, and saving it (while findutils was
running).  I hadn't done the same test with 2.6.29, but I'm pretty sure
it would have been painfully slow if findutils was running.

So, relative to 2.6.29, I see a large improvement in interactive
behavior under high IO load.

-Sanjoy

`Until lions have their historians, tales of the hunt shall always
 glorify the hunters.'  --African Proverb

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-29 14:45                                                         ` Pavel Machek
  2009-03-29 15:47                                                           ` Linus Torvalds
@ 2009-03-30 14:22                                                           ` Morten P.D. Stevens
  1 sibling, 0 replies; 664+ messages in thread
From: Morten P.D. Stevens @ 2009-03-30 14:22 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Bodo Eggert, Linus Torvalds, Matthew Garrett, Alan Cox,
	Theodore Tso, Andrew Morton, David Rees, Jesper Krogh,
	Linux Kernel Mailing List

2009/3/29 Pavel Machek <pavel@ucw.cz>:

> Actually ext2 is more reliable in ext3 --

ext2 more reliable than ext3? Is this a joke?
ext2 is about 15 years and the worst linux file system ever. It´s
slow, no journaling...

ext3 is fast, rockstable and solid.

I think ext4 is more reliable than ext3.

-

Morten

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-29 15:47                                                           ` Linus Torvalds
@ 2009-03-29 19:15                                                             ` Pavel Machek
  0 siblings, 0 replies; 664+ messages in thread
From: Pavel Machek @ 2009-03-29 19:15 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Bodo Eggert, Matthew Garrett, Alan Cox, Theodore Tso,
	Andrew Morton, David Rees, Jesper Krogh,
	Linux Kernel Mailing List

Hi!

> > Actually ext2 is more reliable in ext3 --  fsck tells you
> > about errors on parts of disk that are not normallly used.
> 
> No. ext2 is not more reliable than ext3.
> 
> ext2 gets way more errors (that whole 5s + 30s thing), and has no 
> "data=ordered" mode to even ask for more reliable behavior.
> 
> And even if compared to "data=writeback" (which approximates the ext2 
> writeout ordering), and assuming that the errors are comparable, at least 
> ext3 ends up automatically fixing up a lot of the errors that cause 
> inabilities to boot etc.
> 
> So don't be silly. ext3 is way more reliable than ext2. In fact, ext3 with 
> "data=ordered" is rather hard to screw up (but not impossible), and
> the 

Well, ext3 is pretty good, and if you have reliable hardware&kernel,
so all your unclean reboots are due to powerfails, it is better.

If you have flakey ide cable, bad disk driver, non-intel flash storage
 or memory with bit flips, you are better with ext2 -- it catches
 problems faster. Periodic disk check makes ext3 pretty good,
 unfortunately at least one distro silently disables.  Pavel

--
 (english) http://www.livejournal.com/~pavelmachek (cesky, pictures)
 http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-29 14:45                                                         ` Pavel Machek
@ 2009-03-29 15:47                                                           ` Linus Torvalds
  2009-03-29 19:15                                                             ` Pavel Machek
  2009-03-30 14:22                                                           ` Morten P.D. Stevens
  1 sibling, 1 reply; 664+ messages in thread
From: Linus Torvalds @ 2009-03-29 15:47 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Bodo Eggert, Matthew Garrett, Alan Cox, Theodore Tso,
	Andrew Morton, David Rees, Jesper Krogh,
	Linux Kernel Mailing List



On Sun, 29 Mar 2009, Pavel Machek wrote:
> 
> Actually ext2 is more reliable in ext3 --  fsck tells you
> about errors on parts of disk that are not normallly used.

No. ext2 is not more reliable than ext3.

ext2 gets way more errors (that whole 5s + 30s thing), and has no 
"data=ordered" mode to even ask for more reliable behavior.

And even if compared to "data=writeback" (which approximates the ext2 
writeout ordering), and assuming that the errors are comparable, at least 
ext3 ends up automatically fixing up a lot of the errors that cause 
inabilities to boot etc.

So don't be silly. ext3 is way more reliable than ext2. In fact, ext3 with 
"data=ordered" is rather hard to screw up (but not impossible), and the 
only real complaint in this thread is just the fsync performance issue, 
not the reliability.

So don't go overboard. Ext3 works perfectly well, and has just that one 
(admittedly fairly annoying) major issue - and one that wasn't really 
historically even a big deal. I mean, nobody really did fsync() all that 
much, and traditionally people cared more about throughput than latency 
(or at least that was what all the benchmarks are about, which sadly seems 
to still continue).

I do agree that "data=writeback" is broken, but ext2 was equally broken. 

			Linus

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-28 11:53                                                       ` Bodo Eggert
@ 2009-03-29 14:45                                                         ` Pavel Machek
  2009-03-29 15:47                                                           ` Linus Torvalds
  2009-03-30 14:22                                                           ` Morten P.D. Stevens
  0 siblings, 2 replies; 664+ messages in thread
From: Pavel Machek @ 2009-03-29 14:45 UTC (permalink / raw)
  To: Bodo Eggert
  Cc: Linus Torvalds, Matthew Garrett, Alan Cox, Theodore Tso,
	Andrew Morton, David Rees, Jesper Krogh,
	Linux Kernel Mailing List

On Sat 2009-03-28 12:53:34, Bodo Eggert wrote:
> Linus Torvalds <torvalds@linux-foundation.org> wrote:
> > On Fri, 27 Mar 2009, Linus Torvalds wrote:
> 
> >> Yes, some editors (vi, emacs) do it, but even there it's configurable.
> > 
> > .. and looking at history, it's even pretty modern. From the vim logs:
> > 
> > Patch 6.2.499
> > Problem:   When writing a file and halting the system, the file might be lost
> > when using a journalling file system.
> > Solution:  Use fsync() to flush the file data to disk after writing a file.
> > (Radim Kolar)
> > Files:     src/fileio.c
> > 
> > so it looks (assuming those patch numbers mean what they would seem to
> > mean) that 'fsync()' in vim is from after 6.2 was released. Some time in
> > 2004.
> 
> Besides that, it's a fix specific for /journaled/ filesystems. It's easy to see
> that the same journal that was supposed to increase filesystem reliability
> is CAUSING more unreliable behavior.

Journaling is _not_ supposed to increase filesystem reliability.

It improves fsck time. That's it.

Actually ext2 is more reliable in ext3 --  fsck tells you
about errors on parts of disk that are not normallly used.
								Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-27 21:53                               ` Bodo Eggert
  2009-03-28  6:51                                 ` Mike Galbraith
@ 2009-03-28 12:12                                 ` Theodore Tso
  1 sibling, 0 replies; 664+ messages in thread
From: Theodore Tso @ 2009-03-28 12:12 UTC (permalink / raw)
  To: Bodo Eggert
  Cc: Matthew Garrett, Linus Torvalds, Andrew Morton, David Rees,
	Jesper Krogh, Linux Kernel Mailing List

On Fri, Mar 27, 2009 at 10:53:26PM +0100, Bodo Eggert wrote:
> data=ordered is a sane way of handling data. Otherwise, the millions
> would change their ext3 to data=writeback.

See the discussion about defaulting to "relatime" (or my preferred,
"noatime") mount option.  It's a very sane thing to do, yet most
people don't use anything other than the defaults.  And now we're told
even most distro's are hesitant to tweak tuning parameters away from
the default.

					- Ted

^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
       [not found]                                                     ` <ckoYP-2DC-13@gated-at.bofh.it>
@ 2009-03-28 11:53                                                       ` Bodo Eggert
  2009-03-29 14:45                                                         ` Pavel Machek
  0 siblings, 1 reply; 664+ messages in thread
From: Bodo Eggert @ 2009-03-28 11:53 UTC (permalink / raw)
  To: Linus Torvalds, Matthew Garrett, Alan Cox, Theodore Tso,
	Andrew Morton, David Rees, Jesper Krogh,
	Linux Kernel Mailing List

Linus Torvalds <torvalds@linux-foundation.org> wrote:
> On Fri, 27 Mar 2009, Linus Torvalds wrote:

>> Yes, some editors (vi, emacs) do it, but even there it's configurable.
> 
> .. and looking at history, it's even pretty modern. From the vim logs:
> 
> Patch 6.2.499
> Problem:   When writing a file and halting the system, the file might be lost
> when using a journalling file system.
> Solution:  Use fsync() to flush the file data to disk after writing a file.
> (Radim Kolar)
> Files:     src/fileio.c
> 
> so it looks (assuming those patch numbers mean what they would seem to
> mean) that 'fsync()' in vim is from after 6.2 was released. Some time in
> 2004.

Besides that, it's a fix specific for /journaled/ filesystems. It's easy to see
that the same journal that was supposed to increase filesystem reliability
is CAUSING more unreliable behavior.


^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
  2009-03-27 21:53                               ` Bodo Eggert
@ 2009-03-28  6:51                                 ` Mike Galbraith
  2009-03-28 12:12                                 ` Theodore Tso
  1 sibling, 0 replies; 664+ messages in thread
From: Mike Galbraith @ 2009-03-28  6:51 UTC (permalink / raw)
  To: 7eggert
  Cc: Theodore Tso, Matthew Garrett, Linus Torvalds, Andrew Morton,
	David Rees, Jesper Krogh, Linux Kernel Mailing List

On Fri, 2009-03-27 at 22:53 +0100, Bodo Eggert wrote:

> data=ordered is a sane way of handling data. Otherwise, the millions
> would change their ext3 to data=writeback.

This one of the millions did that quite a while ago.  Sanity be damned,
my quality of life improved by doing so.

	-Mike


^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
       [not found]                         ` <ck93x-2oJ-3@gated-at.bofh.it>
@ 2009-03-27 23:22                           ` Bodo Eggert
  0 siblings, 0 replies; 664+ messages in thread
From: Bodo Eggert @ 2009-03-27 23:22 UTC (permalink / raw)
  To: Andrew Morton, Linus Torvalds, Theodore Tso, David Rees,
	Jesper Krogh, Linux Kernel Mailing List

Andrew Morton <akpm@linux-foundation.org> wrote:
> On Thu, 26 Mar 2009 18:03:15 -0700 (PDT) Linus Torvalds
>> On Thu, 26 Mar 2009, Andrew Morton wrote:

>> > userspace can get closer than the kernel can.
>> 
>> Andrew, that's SIMPLY NOT TRUE.
>> 
>> You state that without any amount of data to back it up, as if it was some
>> kind of truism. It's not.
> 
> I've seen you repeatedly fiddle the in-kernel defaults based on
> in-field experience.  That could just as easily have been done in
> initscripts by distros, and much more effectively because it doesn't
> need a new kernel.  That's data.
> 
> The fact that this hasn't even been _attempted_ (afaik) is deplorable.
> 
> Why does everyone just sit around waiting for the kernel to put a new
> value into two magic numbers which userspace scripts could have set?

Because the user controlling userspace does not understand your knobs.
I want to say "file cache minimum 128 MB (otherwise my system crawls),
max 1,5 GB (or the same happens due to swapping)". Maybe I'll want to say
"start writing if you have data for one second of max. transfer rate"
(obviously a per-device setting).

> My /etc/rc.local has been tweaking dirty_ratio, dirty_background_ratio
> and swappiness for many years.  I guess I'm just incredibly advanced.

These settings are good, but they can't prevent the filecache from
growing until the mouse driver gets swapped out.

You happened to find good settings for your setup. Maybe I did once, too,
but it stopped working for the pathological cases in which I'd need
tweaking (which includes normal operation on my laptop), and having a
numeric swappiness without units or a guide did not help. And instead of
dedicating a week to loading NASA images in GIMP (which was a pathological
case on my old desktop) in order to find acceptable settings, I just didn't
do that then.



^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
       [not found]                             ` <ckcE8-845-9@gated-at.bofh.it>
@ 2009-03-27 21:53                               ` Bodo Eggert
  2009-03-28  6:51                                 ` Mike Galbraith
  2009-03-28 12:12                                 ` Theodore Tso
       [not found]                               ` <ckdgW-uA-1@gated-at.bofh.it>
  1 sibling, 2 replies; 664+ messages in thread
From: Bodo Eggert @ 2009-03-27 21:53 UTC (permalink / raw)
  To: Theodore Tso, Theodore Tso, Matthew Garrett, Linus Torvalds,
	Andrew Morton, David Rees, Jesper Krogh,
	Linux Kernel Mailing List

Theodore Tso <tytso@mit.edu> wrote:
> On Fri, Mar 27, 2009 at 03:47:05AM +0000, Matthew Garrett wrote:

>> Oh, for the love of a whole range of mythological figures. ext3 didn't
>> train application programmers that they could be careless about fsync().
>> It gave them functionality that they wanted, ie the ability to do things
>> like rename a file over another one with the expectation that these
>> operations would actually occur in the same order that they were
>> generated. More to the point, it let them do this *without* having to
>> call fsync(), resulting in a significant improvement in filesystem
>> usability.

> There were plenty of applications that were written for Unix *and*
> Linux systems before ext3 existed, and they worked just fine.  Back
> then, people were drilled into the fact that they needed to use
> fsync(), and fsync() wan't expensive, so there wasn't a big deal in
> terms of usability.  The fact that fsync() was expensive was precisely
> because of ext3's data=ordered problem.  Writing files safely meant
> that you had to check error returns from fsync() *and* close().

> In fact, if you care about making sure that data doesn't get lost due
> to disk errors, you *must* call fsync().

People don't care about data getting lost if hell breaks lose, but they
care if you ensure that killing the data happens, while keeping the data
is delayed.

Or more simple: Old state: good. New state: good. Inbetween state: bad.
And journaling with delayed data is exposing the inbetween state for
a long period.

> Pavel may have complained 
> that fsync() can sometimes drop errors if some other process also has
> the file open and calls fsync() --- but if you don't, and you rely on
> ext3 to magically write the data blocks out as a side effect of the
> commit in data=ordered mode, there's no way to signal the write error
> to the application, and you are *guaranteed * to lose the I/O error
> indication.

Fortunately, IO errors are not common, and errors=remount-ro will
prevent it from being fatal.

> I can tell you quite authoritatively that we didn't implement
> data=ordered to make life easier for application writers, and
> application writers didn't come to ext3 developers asking for this
> convenience.  It may have **accidentally** given them convenience that
> they wanted, but it also made fsync() slow.

data=ordered is a sane way of handling data. Otherwise, the millions
would change their ext3 to data=writeback.

>> I'm utterly and screamingly bored of this "Blame userspace" attitude.
> 
> I'm not blaming userspace.  I'm blaming ourselves, for implementing an
> attractive nuisance, and not realizing that we had implemented an
> attractive nuisance; which years later, is also responsible for these
> latency problems, both with and without fsync() ---- *and* which have
> also traied people into believing that fsync() is always expensive,
> and must be avoided at all costs --- which had not previously been
> true!

I've been waiting ages for a sync() to complete long before reiserfs was
out to make ext2 jealous. Besides that, I don't need the data to be on disk,
I need the update to be mostly-atomic, leaving only small gaps to destroy my
data. Pure chance can (and usually will) give me a better guarantee than what
ext4 did.

I don't know about the logic you put into ext4 to work around the issue, but
I can imagine marking empty-file inodes (and O_APPEND or any i~?) as poisoned
if delayed blocks are appended, and if these poisoned inodes (and depending
operations) don't get played back, it might work acceptably.


^ permalink raw reply	[flat|nested] 664+ messages in thread

* Re: Linux 2.6.29
       [not found]             ` <cjeS0-5nC-33@gated-at.bofh.it>
@ 2009-03-25 15:19               ` Bodo Eggert
  0 siblings, 0 replies; 664+ messages in thread
From: Bodo Eggert @ 2009-03-25 15:19 UTC (permalink / raw)
  To: Theodore Tso, Theodore Tso, Ingo Molnar, Alan Cox,
	Arjan van de Ven, Andrew Morton, Peter Zijlstra, Nick Piggin,
	Jens Axboe, David Rees, Jesper Krogh, Linus Torvalds,
	Linux Kernel Mailing List

Theodore Tso <tytso@mit.edu> wrote:

> OK, so there are a couple of solutions to this problem.  One is to use
> ext4 and delayed allocation.  This solves the problem by simply not
> allocating the blocks in the first place, so we don't have to force
> them out to solve the security problem that data=ordered was trying to
> solve.  Simply mounting an ext3 filesystem using ext4, without making
> any change to the filesystem format, should solve the problem.

[...]

> However, these days, nearly all Linux boxes are single user machines,
> so the security concern is much less of a problem.  So maybe the best
> solution for now is to make data=writeback the default.  This solves
> the problem too.  The only problem with this is that there are a lot
> of sloppy application writers out there, and they've gotten lazy about
> using fsync() where it's necessary;

The problem is not having accidential data loss because the inode /happened/
to be written before the data, but having /guaranteed/ data loss in a
60-seconds-window. This is about as acceptable as having a filesystem
replace _any_ data with "deadbeef" on each crash unless fsync was called.

Besides that: If the problem is due to crappy VM writeout (is it?), reducing
security to DOS level is not the answer. You'd want your fs to be usable on
servers, wouldn't you?


^ permalink raw reply	[flat|nested] 664+ messages in thread

end of thread, other threads:[~2009-05-18 16:40 UTC | newest]

Thread overview: 664+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-03-23 23:29 Linux 2.6.29 Linus Torvalds
2009-03-24  6:19 ` Jesper Krogh
2009-03-24  6:46   ` David Rees
2009-03-24  7:32     ` Jesper Krogh
2009-03-24  8:16       ` Ingo Molnar
2009-03-24 11:10         ` Jesper Krogh
2009-03-24 19:00       ` David Rees
2009-03-25 17:42         ` Jesper Krogh
2009-03-25 18:16           ` David Rees
2009-03-25 18:46             ` Jesper Krogh
2009-03-25 18:30         ` Theodore Tso
2009-03-25 18:40           ` Linus Torvalds
2009-03-25 22:05             ` Theodore Tso
2009-03-25 23:23               ` Linus Torvalds
2009-03-25 23:46                 ` Bron Gondwana
2009-03-26  0:32                   ` Ric Wheeler
2009-03-27  0:11                 ` Andrew Morton
2009-03-27  0:27                   ` Linus Torvalds
2009-03-27  0:47                     ` Andrew Morton
2009-03-27  1:03                       ` Linus Torvalds
2009-03-27  1:25                         ` Andrew Morton
2009-03-27  2:21                           ` David Rees
2009-03-27  3:03                             ` Matthew Garrett
2009-03-27  3:36                             ` Dave Jones
2009-03-27  3:01                           ` Matthew Garrett
2009-03-27  3:38                           ` Linus Torvalds
2009-03-27  3:59                             ` Linus Torvalds
2009-03-28 23:52                               ` david
2009-03-28  5:06                           ` Ingo Molnar
2009-04-01 21:03                           ` Lennart Sorensen
2009-04-01 21:36                             ` Andrew Morton
2009-04-01 22:57                               ` Lennart Sorensen
2009-04-03 14:46                                 ` Mark Lord
2009-04-03 15:16                                   ` Lennart Sorensen
2009-04-03 15:42                                     ` Mark Lord
2009-04-03 18:59                                     ` Jeff Garzik
2009-04-04  8:18                                       ` Andrew Morton
2009-04-04 12:40                                         ` Mark Lord
2009-04-05  1:57                                         ` David Newall
2009-04-05  3:46                                           ` Mark Lord
2009-04-02  1:00                               ` Ingo Molnar
2009-04-03  4:06                                 ` Lennart Sorensen
2009-04-03  4:13                                   ` Linus Torvalds
2009-04-03  7:25                                     ` Jens Axboe
2009-04-03  8:15                                       ` Ingo Molnar
2009-04-06 21:46                                         ` Bill Davidsen
2009-04-03 14:21                                       ` Lennart Sorensen
2009-04-03 15:05                                         ` Mark Lord
2009-04-03 15:14                                           ` Lennart Sorensen
2009-04-03 19:57                                           ` Jeff Garzik
2009-04-03 21:28                                             ` Janne Grunau
2009-04-03 21:57                                               ` Jeff Garzik
2009-04-03 22:32                                                 ` Janne Grunau
2009-04-03 22:57                                                   ` David Rees
2009-04-03 23:29                                                   ` Jeff Garzik
2009-04-03 23:52                                                     ` Linus Torvalds
2009-04-03 22:53                                                 ` David Rees
2009-04-03 23:30                                                   ` Jeff Garzik
2009-04-04 16:29                                                     ` Janne Grunau
2009-04-04 23:02                                                       ` Jeff Garzik
2009-04-05 14:20                                                         ` Janne Grunau
2009-04-06 14:06                                                         ` Lennart Sorensen
2009-04-03 22:28                                   ` Jeff Moyer
2009-04-06 14:15                                     ` Lennart Sorensen
2009-04-06 21:27                                       ` Mark Lord
2009-04-06 21:56                                         ` Lennart Sorensen
2009-04-02 11:05                             ` Janne Grunau
2009-04-02 16:09                               ` Andrew Morton
2009-04-02 16:33                                 ` David Rees
2009-04-02 16:46                                   ` Linus Torvalds
2009-04-02 16:51                                   ` Andrew Morton
2009-04-02 22:13                                     ` Jeff Garzik
2009-04-02 21:56                                   ` Jeff Garzik
2009-04-02 16:29                               ` David Rees
2009-04-02 16:42                                 ` Andrew Morton
2009-04-02 16:57                                   ` Linus Torvalds
2009-04-02 17:04                                     ` Linus Torvalds
2009-04-02 22:09                                       ` Jeff Garzik
2009-04-02 22:42                                         ` Linus Torvalds
2009-04-02 22:51                                           ` Andrew Morton
2009-04-02 23:00                                             ` Linus Torvalds
2009-04-03  2:01                                           ` Jeff Garzik
2009-04-03  2:16                                             ` Linus Torvalds
2009-04-03  3:05                                               ` Jeff Garzik
2009-04-03  3:34                                                 ` Linus Torvalds
2009-04-03 11:32                                                   ` Chris Mason
2009-04-03 15:07                                                     ` Linus Torvalds
2009-04-03 15:40                                                       ` Chris Mason
2009-04-03 20:05                                                       ` Jeff Garzik
2009-04-03 20:14                                                         ` Linus Torvalds
2009-04-03 21:48                                                           ` Jeff Garzik
2009-04-03 22:06                                                             ` Linus Torvalds
2009-04-03 23:48                                                               ` Jeff Garzik
2009-04-04 12:46                                                                 ` Mark Lord
2009-04-04 12:52                                                                   ` Huang Yuntao
2009-04-03 23:35                                                           ` Dave Jones
2009-04-04 12:44                                                         ` Mark Lord
2009-04-04 21:10                                                           ` Jeff Garzik
2009-04-03  5:05                                               ` Nick Piggin
2009-04-03  8:31                                               ` Jeff Garzik
2009-04-03  8:35                                                 ` Jeff Garzik
2009-04-03  2:38                                             ` Trenton D. Adams
2009-04-03  2:54                                               ` Jeff Garzik
2009-04-03 15:14                                       ` Mark Lord
2009-04-03 15:18                                         ` Lennart Sorensen
2009-04-03 15:46                                           ` Mark Lord
2009-04-03 15:28                                         ` Linus Torvalds
2009-04-02 18:52                                   ` David Rees
2009-04-02 21:42                                 ` Theodore Tso
2009-04-02 21:50                               ` Lennart Sorensen
2009-04-03 15:07                               ` Mark Lord
2009-04-02 12:17                             ` Theodore Tso
2009-04-02 21:54                               ` Lennart Sorensen
2009-04-02 23:27                                 ` Theodore Tso
2009-04-03  0:32                                   ` Lennart Sorensen
2009-03-27  3:23                         ` Theodore Tso
2009-03-27  3:47                           ` Matthew Garrett
2009-03-27  5:13                             ` Theodore Tso
2009-03-27  5:57                               ` Matthew Garrett
2009-03-27  6:21                                 ` Matthew Garrett
2009-03-27 11:24                                   ` Theodore Tso
2009-03-27 12:17                                     ` Linux 2.6.29 - delayed metadata for delayed allocation? Andreas T.Auer
2009-03-27 14:51                                     ` Linux 2.6.29 Matthew Garrett
2009-03-27 15:08                                       ` Alan Cox
2009-03-27 15:22                                         ` Matthew Garrett
2009-03-27 16:15                                           ` Alan Cox
2009-03-27 16:28                                             ` Matthew Garrett
2009-03-27 16:51                                               ` Alan Cox
2009-03-27 17:02                                                 ` Matthew Garrett
2009-03-27 17:19                                                   ` Alan Cox
2009-03-27 18:05                                                     ` Linus Torvalds
2009-03-27 18:35                                                       ` Alan Cox
2009-03-27 19:03                                                       ` Theodore Tso
2009-03-27 19:14                                                         ` Alan Cox
2009-03-27 19:32                                                           ` Theodore Tso
2009-03-27 20:11                                                             ` Andreas T.Auer
2009-03-27 22:01                                                               ` Linus Torvalds
2009-03-31  9:58                                                             ` Neil Brown
2009-03-27 19:19                                                         ` Gene Heskett
2009-03-27 19:48                                                           ` Theodore Tso
2009-03-27 20:02                                                             ` Aaron Cohen
     [not found]                                                             ` <727e50150903271301l36cff340l33e813bf6f77b4b@mail.gmail.com>
2009-03-27 20:04                                                               ` Theodore Tso
2009-03-27 22:37                                                             ` Gene Heskett
2009-03-27 22:55                                                               ` Theodore Tso
2009-03-28  0:42                                                                 ` Gene Heskett
2009-03-27 18:36                                                     ` Hua Zhong
2009-03-27 17:57                                                   ` Linus Torvalds
2009-03-27 18:22                                                     ` Linus Torvalds
2009-03-27 18:32                                                     ` Alan Cox
2009-03-27 18:40                                                       ` Linus Torvalds
2009-03-27 19:00                                                         ` Alan Cox
2009-03-29  9:15                                                           ` Xavier Bestel
2009-03-29 20:16                                                             ` Alan Cox
2009-03-29 21:07                                                               ` Linus Torvalds
2009-03-30 19:37                                                                 ` Jeremy Fitzhardinge
2009-03-27 20:27                                                         ` Felipe Contreras
2009-03-27 19:43                                                     ` Jeff Garzik
2009-03-27 20:01                                                       ` Theodore Tso
2009-03-27 22:20                                                         ` Jeff Garzik
2009-03-27 21:46                                                       ` Linus Torvalds
2009-03-27 22:06                                                         ` Jeff Garzik
2009-03-27 22:19                                                           ` Linus Torvalds
2009-03-27 22:25                                                             ` Linus Torvalds
2009-03-28  1:19                                                               ` Jeff Garzik
2009-03-28  1:30                                                                 ` David Miller
2009-03-28  2:19                                                                 ` Mark Lord
2009-03-28  2:49                                                                   ` Jeff Garzik
2009-03-28 13:29                                                                     ` Stefan Richter
2009-03-28 14:17                                                                       ` Jeff Garzik
2009-03-28 14:35                                                                         ` Stefan Richter
2009-03-28 15:17                                                                           ` Mark Lord
2009-03-28 16:08                                                                             ` Stefan Richter
2009-03-28 16:32                                                                               ` Linus Torvalds
2009-03-28 17:22                                                                                 ` David Hagood
2009-03-29  1:18                                                                             ` Jeff Garzik
2009-03-31 18:45                                                                               ` Jörn Engel
2009-03-29 23:14                                                                             ` Dave Chinner
2009-03-30  0:39                                                                               ` Theodore Tso
2009-03-30  1:29                                                                                 ` Trenton Adams
2009-03-30  3:28                                                                                   ` Theodore Tso
2009-03-30  3:55                                                                                     ` Trenton D. Adams
2009-03-30 13:45                                                                                       ` Theodore Tso
2009-03-30  6:31                                                                                 ` Dave Chinner
2009-03-30 13:55                                                                                   ` Theodore Tso
2009-03-30  7:13                                                                                 ` Andreas T.Auer
2009-03-30  9:05                                                                                   ` Alan Cox
2009-03-30 10:49                                                                                     ` Andreas T.Auer
2009-03-30 11:12                                                                                       ` Alan Cox
2009-03-30 11:17                                                                                       ` Ric Wheeler
2009-03-30 13:48                                                                                         ` Mark Lord
2009-03-30 14:00                                                                                           ` Ric Wheeler
2009-03-30 14:44                                                                                             ` Mark Lord
2009-03-30 14:58                                                                                               ` Ric Wheeler
2009-03-30 15:21                                                                                                 ` Mark Lord
2009-03-30 15:27                                                                                                   ` Ric Wheeler
2009-03-30 16:13                                                                                                     ` Linus Torvalds
2009-03-30 16:30                                                                                                       ` Mark Lord
2009-03-30 16:58                                                                                                         ` Linus Torvalds
2009-03-30 17:29                                                                                                           ` Mark Lord
2009-03-30 17:57                                                                                                           ` Chris Mason
2009-03-30 18:39                                                                                                             ` Mark Lord
2009-03-30 18:52                                                                                                               ` Chris Mason
2009-03-30 20:19                                                                                                                 ` Mark Lord
2009-03-30 18:54                                                                                                             ` Pasi Kärkkäinen
2009-03-30 15:00                                                                                               ` Jeff Garzik
2009-03-30 15:34                                                                                         ` Linus Torvalds
2009-03-30 16:11                                                                                           ` Ric Wheeler
2009-03-30 16:34                                                                                             ` Linus Torvalds
2009-03-30 17:11                                                                                               ` Ric Wheeler
2009-03-30 17:39                                                                                                 ` Mark Lord
2009-03-30 17:51                                                                                                 ` Linus Torvalds
2009-03-30 18:15                                                                                                   ` Ric Wheeler
2009-03-30 19:08                                                                                                   ` Eric Sandeen
2009-03-30 19:22                                                                                                   ` Rik van Riel
2009-03-30 19:41                                                                                                     ` Jeff Garzik
2009-03-30 20:21                                                                                                       ` Michael Tokarev
2009-03-30 20:26                                                                                                         ` Mark Lord
2009-03-30 20:29                                                                                                           ` Mark Lord
2009-03-30 20:35                                                                                                           ` Jeff Garzik
2009-03-30 20:40                                                                                                             ` Mark Lord
2009-03-30 20:34                                                                                                         ` Jeff Garzik
2009-03-30 20:05                                                                                                     ` Linus Torvalds
2009-03-31  9:27                                                                                                     ` Neil Brown
2009-03-31 21:13                                                                                                     ` Alan Cox
2009-03-31 21:10                                                                                               ` Alan Cox
2009-03-31 21:55                                                                                                 ` Linus Torvalds
2009-03-30 19:02                                                                                   ` Bill Davidsen
2009-04-01  1:19                                                                                     ` david
2009-04-01 16:24                                                                                       ` Bill Davidsen
2009-04-01 20:15                                                                                         ` david
2009-04-01 21:33                                                                                           ` Andreas T.Auer
2009-04-01 22:29                                                                                             ` david
2009-04-02  2:30                                                                                               ` Bron Gondwana
2009-04-02  4:55                                                                                                 ` david
2009-04-02  5:29                                                                                                   ` Bron Gondwana
2009-04-02  9:58                                                                                                   ` Andreas T.Auer
2009-04-02 12:30                                                                                               ` Bill Davidsen
2009-04-01 22:00                                                                                           ` Harald Arnesen
2009-04-01 22:09                                                                                             ` Alejandro Riveira Fernández
2009-04-01 22:28                                                                                             ` david
2009-03-30  3:01                                                                               ` Mark Lord
2009-03-30  6:41                                                                                 ` Andreas T.Auer
2009-03-30 12:55                                                                               ` Chris Mason
2009-03-30 17:42                                                                                 ` Theodore Tso
2009-03-31 23:55                                                                                 ` Dave Chinner
2009-04-01 12:53                                                                                   ` Chris Mason
2009-04-01 15:41                                                                                     ` Andreas T.Auer
2009-04-01 16:02                                                                                       ` Chris Mason
2009-04-01 18:37                                                                                         ` Andreas T.Auer
2009-04-01 21:50                                                                                         ` Theodore Tso
2009-04-01 23:44                                                                                           ` Matthew Garrett
2009-03-28 16:25                                                                           ` Alex Goebel
2009-03-28 21:12                                                                             ` Hua Zhong
2009-03-29  8:22                                                                               ` Stefan Richter
2009-03-29  0:33                                                                 ` david
2009-03-29  1:24                                                                   ` Jeff Garzik
2009-03-29  3:43                                                                     ` Theodore Tso
2009-03-29  4:53                                                                       ` Jeff Garzik
2009-03-31 15:01                                                                 ` Thierry Vignaud
2009-03-28  0:18                                                             ` Jeff Garzik
2009-03-28  1:45                                                               ` Linus Torvalds
2009-03-28  2:53                                                                 ` Jeff Garzik
2009-03-28  2:56                                                                   ` Zid Null
2009-03-28  3:55                                                                   ` Gene Heskett
2009-03-28 11:29                                                                     ` Alejandro Riveira Fernández
2009-03-28  2:16                                                             ` Mark Lord
2009-03-28  2:38                                                               ` Linus Torvalds
2009-03-28 11:57                                                                 ` Andreas T.Auer
2009-03-27 15:20                                       ` Giacomo A. Catenazzi
2009-03-27 21:11                                     ` Jeremy Fitzhardinge
2009-03-28  7:45                                       ` Bojan Smojver
2009-03-28  8:43                                         ` Bojan Smojver
2009-03-28 21:55                                           ` Bojan Smojver
2009-03-31 21:51                                             ` Jeremy Fitzhardinge
2009-03-31 22:30                                               ` Bojan Smojver
2009-04-01  5:26                                                 ` Bojan Smojver
2009-04-01  6:35                                                   ` Jeremy Fitzhardinge
2009-04-03 12:39                               ` Pavel Machek
2009-03-27  0:51                     ` Linus Torvalds
2009-03-27  1:03                       ` Andrew Morton
2009-03-27  9:58                   ` Alan Cox
2009-03-26  2:50               ` Neil Brown
2009-03-26  3:13                 ` Theodore Tso
2009-03-24  9:15     ` Alan Cox
2009-03-24  9:32       ` Ingo Molnar
2009-03-24 10:10         ` Alan Cox
2009-03-24 10:31           ` Ingo Molnar
2009-03-24 11:12             ` Andrew Morton
2009-03-24 12:23               ` Alan Cox
2009-03-24 13:37               ` Theodore Tso
2009-03-25 12:37               ` Jan Kara
2009-03-25 15:00                 ` Theodore Tso
2009-03-25 17:29                   ` Linus Torvalds
2009-03-25 17:57                     ` Alan Cox
2009-03-25 18:09                     ` David Rees
2009-03-25 18:21                       ` Linus Torvalds
2009-03-25 18:26                         ` Linus Torvalds
2009-03-25 18:48                           ` Ric Wheeler
2009-03-25 18:49                           ` Alan Cox
2009-03-25 18:55                             ` Ric Wheeler
2009-03-25 18:58                     ` Theodore Tso
2009-03-25 19:48                       ` Christoph Hellwig
2009-03-25 21:50                         ` Theodore Tso
2009-03-26  2:10                           ` Matthew Garrett
2009-03-26  2:36                             ` Jeff Garzik
2009-03-26  2:42                               ` Matthew Garrett
     [not found]                             ` <f73f7ab80903251944s581166bbk31c26db50750814a@mail.gmail.com>
2009-03-26  2:46                               ` Kyle Moffett
2009-03-26  2:51                                 ` Jeff Garzik
2009-03-26  3:03                                   ` Kyle Moffett
2009-03-26  3:40                                     ` Linus Torvalds
2009-03-26  3:57                                       ` David Miller
2009-03-26  4:58                                       ` Kyle Moffett
2009-03-26  6:24                                         ` Jeff Garzik
2009-03-26 12:49                                           ` Kyle Moffett
2009-03-26  2:47                               ` Matthew Garrett
2009-03-26  2:54                                 ` Kyle Moffett
2009-03-25 20:45                       ` Linus Torvalds
2009-03-25 21:51                         ` Theodore Tso
2009-03-25 23:21                           ` Linus Torvalds
2009-03-25 23:50                             ` Jan Kara
2009-03-26  0:04                               ` Linus Torvalds
2009-03-26  9:06                               ` ext3 IO latency measurements (was: Linux 2.6.29) Ingo Molnar
2009-03-26  9:09                                 ` Ingo Molnar
2009-03-26 11:08                                 ` Jens Axboe
2009-03-26 14:27                                   ` Arjan van de Ven
2009-03-26 14:36                                     ` Jens Axboe
2009-03-26 14:49                                       ` Arjan van de Ven
2009-03-26 11:37                                 ` Theodore Tso
2009-03-26 12:44                                   ` Ingo Molnar
2009-03-26 12:46                                   ` Ingo Molnar
2009-03-26 14:03                                   ` Ingo Molnar
2009-03-26 14:13                                     ` Ingo Molnar
2009-03-26 14:30                                     ` Andrew Morton
2009-03-26 15:32                                       ` relatime: update once per day patches (was: ext3 IO latency measurements) Frans Pop
2009-03-26 15:47                                         ` Andrew Morton
2009-03-26 16:14                                           ` Linus Torvalds
2009-03-26 16:24                                             ` Andrew Morton
2009-03-26 17:12                                               ` Frans Pop
2009-03-26 17:48                                                 ` Andrew Morton
2009-03-26 18:49                                                   ` Matthew Garrett
2009-03-26 19:20                                                     ` Andrew Morton
2009-03-26 19:43                                                       ` Matthew Garrett
2009-03-27 11:25                                                       ` David Hagood
2009-03-26 16:30                                             ` Theodore Tso
2009-03-26 16:40                                               ` Jose Celestino
2009-03-26 17:14                                                 ` Frans Pop
2009-03-26 16:53                                               ` Frans Pop
2009-03-26 16:53                                               ` Linus Torvalds
2009-03-26 17:32                                             ` [PATCH] Allow relatime to update atime once a day Matthew Garrett
2009-03-26 17:56                                               ` Alexey Dobriyan
2009-03-26 18:55                                               ` Alan Cox
2009-03-26 14:47                                     ` ext3 IO latency measurements (was: Linux 2.6.29) Theodore Tso
2009-03-26 16:20                                       ` Linus Torvalds
2009-03-26 17:06                                         ` Andreas Schwab
2009-03-26 17:07                                         ` Theodore Tso
2009-03-26 17:16                                           ` Linus Torvalds
2009-03-26 17:49                                             ` [PATCH 1/2] Add a strictatime mount option Matthew Garrett
2009-03-26 17:53                                               ` [PATCH 2/2] Make relatime default Matthew Garrett
2009-03-26 18:48                                                 ` Alan Cox
2009-03-26 22:27                                                   ` Linus Torvalds
2009-03-27  0:15                                                     ` Frans Pop
2009-03-27  0:30                                                       ` Linus Torvalds
2009-03-27 14:06                                                         ` Alan Cox
2009-03-27  2:05                                                     ` Frans Pop
2009-04-09 20:13                                                     ` Pavel Machek
2009-04-09 20:47                                                       ` Linus Torvalds
2009-04-09 21:15                                                         ` Pavel Machek
2009-04-09 21:20                                                           ` Linus Torvalds
2009-04-09 22:00                                                             ` Pavel Machek
2009-03-30 14:42                                                   ` Andrea Arcangeli
2009-03-30 14:52                                                     ` Xavier Bestel
2009-03-30 19:26                                                     ` Bill Davidsen
2009-03-26 18:52                                               ` [PATCH 1/2] Add a strictatime mount option Alan Cox
2009-03-26 18:59                                             ` ext3 IO latency measurements (was: Linux 2.6.29) Alan Cox
2009-03-26 20:02                                               ` Matthew Garrett
2009-03-26 20:42                                                 ` Alan Cox
2009-03-26 20:55                                                   ` Matthew Garrett
2009-03-26 20:58                                                     ` Alan Cox
2009-03-26 23:04                                                   ` Bron Gondwana
2009-03-27 11:22                                                     ` Alan Cox
2009-03-27 12:19                                                       ` Bron Gondwana
2009-03-27 13:56                                                         ` Alan Cox
2009-03-27 12:00                                               ` ext3 IO latency measurements Giacomo A. Catenazzi
2009-03-26 17:29                                         ` ext3 IO latency measurements (was: Linux 2.6.29) Frans Pop
2009-03-26 17:32                                         ` Bill Nottingham
2009-03-26 17:41                                           ` Linus Torvalds
2009-03-26 18:23                                             ` Bill Nottingham
2009-03-26 22:24                                               ` Linus Torvalds
2009-03-27 13:47                                                 ` Bill Nottingham
2009-03-26 18:54                                           ` Alan Cox
2009-03-26 15:28                                     ` Theodore Tso
2009-03-26 23:02                                       ` Ingo Molnar
2009-03-26 23:59                                         ` Theodore Tso
2009-03-27  0:08                                           ` Ingo Molnar
2009-03-27  0:40                                           ` Jesse Barnes
2009-03-31 11:51                                   ` Neil Brown
2009-03-26 12:22                                 ` Pekka Enberg
2009-03-26 12:23                                   ` Pekka Enberg
2009-03-26 14:38                                 ` Andrew Morton
2009-03-26 18:11                                 ` Jan Kara
2009-03-26 18:35                                   ` Andrew Morton
2009-03-27 21:26                                     ` Jan Kara
2009-03-26 22:39                                   ` Linus Torvalds
2009-03-26 22:57                                     ` Andrew Morton
2009-03-27 21:38                                       ` Jan Kara
2009-03-27 22:10                                         ` Linus Torvalds
2009-03-28 19:43                                           ` Andrew Morton
2009-04-09 21:59                                 ` updated: ext3 IO latency measurements on v2.6.30-rc1 Ingo Molnar
2009-04-10  7:34                                   ` Heinz Diehl
2009-05-18 16:37                                   ` Sanjoy Mahajan
2009-03-25 23:57                             ` Linux 2.6.29 Linus Torvalds
2009-03-26  0:22                               ` Jan Kara
2009-03-26  1:34                                 ` Linus Torvalds
2009-03-26  2:59                                   ` Theodore Tso
2009-03-26 16:24                                   ` Jan Kara
2009-03-24 13:20             ` Theodore Tso
2009-03-24 13:30               ` Ingo Molnar
2009-03-24 13:51                 ` Theodore Tso
2009-03-24 16:34                   ` Jesper Krogh
2009-03-24 17:32                     ` Linus Torvalds
2009-03-24 18:20                   ` Mark Lord
2009-03-24 18:41                     ` Eric Sandeen
2009-03-24 13:52               ` Alan Cox
2009-03-24 14:28                 ` Theodore Tso
2009-03-24 15:18                   ` Alan Cox
2009-03-24 17:55                 ` Jan Kara
2009-03-24 17:55               ` Linus Torvalds
2009-03-24 18:41                 ` Kyle Moffett
2009-03-24 19:17                   ` Linus Torvalds
2009-03-24 18:45                 ` Theodore Tso
2009-03-24 19:21                   ` Linus Torvalds
2009-03-24 19:40                     ` Ric Wheeler
2009-03-24 19:55                     ` Jeff Garzik
2009-03-25  9:34                       ` Benny Halevy
2009-03-25  9:39                       ` Jens Axboe
2009-03-25 19:32                         ` Jeff Garzik
2009-03-25 19:43                           ` Christoph Hellwig
2009-03-25 19:43                           ` Jens Axboe
2009-03-25 19:49                             ` Ric Wheeler
2009-03-25 19:57                               ` Jens Axboe
2009-03-25 20:41                                 ` Hugh Dickins
2009-03-26  8:57                                   ` Jens Axboe
2009-03-26 14:47                                     ` Hugh Dickins
2009-03-26 15:46                                       ` Jens Axboe
2009-03-26 18:21                                         ` Hugh Dickins
2009-03-26 18:32                                           ` Jens Axboe
2009-03-26 19:00                                             ` Hugh Dickins
2009-03-26 19:03                                               ` Jens Axboe
2009-03-25 20:16                               ` Jeff Garzik
2009-03-25 20:25                                 ` Ric Wheeler
2009-03-25 21:22                                   ` James Bottomley
2009-03-26  8:59                                     ` Jens Axboe
2009-03-30 19:05                                     ` range-based cache flushing (was Re: Linux 2.6.29) Jeff Garzik
2009-04-01  0:14                                       ` James Bottomley
2009-04-01  1:28                                         ` Jeff Garzik
2009-04-01 21:20                                           ` James Bottomley
2009-03-25 21:27                                 ` Linux 2.6.29 Benny Halevy
2009-03-25 20:25                             ` Jeff Garzik
2009-03-25 20:40                               ` Linus Torvalds
2009-03-25 20:57                                 ` Ric Wheeler
2009-03-25 23:02                                   ` Linus Torvalds
2009-03-26  0:28                                     ` Ric Wheeler
2009-03-26  1:36                                       ` Linus Torvalds
2009-03-25 21:29                                 ` [PATCH] issue storage device flush via sync_blockdev() (was Re: Linux 2.6.29) Jeff Garzik
2009-03-25 21:56                                   ` Eric Sandeen
2009-03-25 23:08                                     ` Jeff Garzik
2009-03-26  2:31                                       ` Eric Sandeen
2009-03-26 14:19                                         ` Ric Wheeler
2009-03-26  0:58                                     ` Ric Wheeler
2009-03-26  1:26                                       ` Jeff Garzik
2009-03-26  1:33                                         ` Jeff Garzik
2009-03-26  1:39                                           ` Ric Wheeler
2009-03-26  8:24                                         ` Christoph Hellwig
2009-03-27  7:59                                     ` Jens Axboe
2009-03-25 22:01                                   ` Alan Cox
2009-03-25 23:12                                     ` Jeff Garzik
2009-03-26  3:24                                   ` [PATCH v2] issue storage device flush via sync_blockdev() Jeff Garzik
2009-03-27  2:50                                     ` Theodore Tso
2009-03-27  3:17                                       ` Jeff Garzik
2009-03-27  3:30                                         ` Theodore Tso
2009-03-27 20:50                                     ` [PATCH] issue storage dev flush from generic file_fsync helper Jeff Garzik
2009-03-29  8:25                                       ` Christoph Hellwig
2009-03-30  1:25                                         ` Fernando Luis Vázquez Cao
2009-03-30  1:36                                           ` [PATCH 1/5] block: Add block_flush_device() Fernando Luis Vázquez Cao
2009-03-30  1:40                                           ` [PATCH 2/5] ext3: call blkdev_issue_flush on fsync() Fernando Luis Vázquez Cao
2009-03-30  1:51                                             ` Jeff Garzik
2009-03-30  2:50                                               ` Fernando Luis Vázquez Cao
2009-03-30 12:04                                                 ` Fernando Luis Vázquez Cao
2009-03-30 12:09                                                   ` [PATCH 1/7] block: Add block_flush_device() Fernando Luis Vázquez Cao
2009-03-30 15:07                                                     ` Bartlomiej Zolnierkiewicz
2009-03-31  6:09                                                       ` Fernando Luis Vázquez Cao
2009-03-30 17:34                                                     ` Linus Torvalds
2009-03-30 17:50                                                       ` Jeff Garzik
2009-03-30 17:55                                                       ` Jens Axboe
2009-03-30 18:27                                                         ` Linus Torvalds
2009-03-30 18:54                                                           ` Jens Axboe
2009-03-30 19:16                                                             ` Jeff Garzik
2009-03-30 19:24                                                               ` Chris Mason
2009-03-30 20:09                                                                 ` Andi Kleen
2009-03-30 20:15                                                                   ` Chris Mason
2009-03-30 19:59                                                               ` Linus Torvalds
2009-03-30 20:31                                                                 ` Jeff Garzik
2009-03-30 19:45                                                             ` Linus Torvalds
2009-03-30 20:17                                                               ` Jens Axboe
2009-03-30 20:36                                                                 ` Linus Torvalds
2009-03-31  2:14                                                                   ` Ric Wheeler
2009-03-31  2:47                                                                     ` Linus Torvalds
2009-03-31  6:04                                                                       ` Jens Axboe
2009-03-31 11:15                                                                       ` Ric Wheeler
2009-03-31 14:55                                                                         ` Linus Torvalds
2009-03-31 15:22                                                                           ` Chris Mason
2009-03-31 15:41                                                                           ` Ric Wheeler
2009-03-31 16:15                                                                             ` Linus Torvalds
2009-03-31 16:43                                                                               ` Jens Axboe
2009-03-31 16:57                                                                                 ` Linus Torvalds
2009-03-31 17:19                                                                                   ` Jens Axboe
2009-04-01  0:54                                                                                     ` Tejun Heo
2009-03-31 17:03                                                                                 ` Jens Axboe
2009-04-01  0:43                                                                                   ` Tejun Heo
2009-03-31 17:14                                                                               ` Ric Wheeler
2009-03-31 19:25                                                                             ` Mark Lord
2009-03-31 15:54                                                                           ` Linus Torvalds
2009-03-31 16:29                                                                             ` Alan Cox
2009-03-31  6:01                                                                   ` Jens Axboe
2009-03-30 20:52                                                                 ` Mark Lord
2009-03-30 20:57                                                                   ` Jeff Garzik
2009-03-31 13:16                                                                   ` Chris Mason
2009-03-31 13:23                                                                     ` Mark Lord
2009-03-31 13:28                                                                       ` Chris Mason
2009-03-31 15:49                                                                   ` Eric Sandeen
2009-03-31 16:37                                                                     ` Mark Lord
2009-03-30 12:11                                                   ` [PATCH 2/7] ext3: call blkdev_issue_flush() on fsync() Fernando Luis Vázquez Cao
2009-03-30 14:04                                                     ` Theodore Tso
2009-03-30 14:15                                                       ` Chris Mason
2009-03-30 14:33                                                         ` Theodore Tso
2009-03-31  1:26                                                           ` Tejun Heo
2009-03-31  1:58                                                             ` Theodore Tso
2009-03-31  2:14                                                               ` Tejun Heo
2009-03-31 11:18                                                             ` Jens Axboe
2009-03-31 21:29                                                               ` Jeff Garzik
2009-04-01  1:03                                                                 ` Tejun Heo
2009-03-30 12:15                                                   ` [PATCH 3/7] ext4: " Fernando Luis Vázquez Cao
2009-03-30 12:18                                                   ` [PATCH 4/7] vfs: call blkdev_issue_flush() from generic file_fsync() helper Fernando Luis Vázquez Cao
2009-03-30 12:22                                                   ` [PATCH 5/7] vfs: Add wbcflush sysfs knob to disable storage device writeback cache flushes Fernando Luis Vázquez Cao
2009-03-30 12:36                                                     ` Jens Axboe
2009-03-30 14:18                                                       ` Fernando Luis Vázquez Cao
2009-03-30 14:35                                                         ` Jens Axboe
2009-03-31  6:49                                                           ` Fernando Luis Vázquez Cao
2009-03-31 10:38                                                             ` Jens Axboe
2009-03-31 11:56                                                               ` Fernando Luis Vázquez Cao
2009-03-30 15:14                                                     ` Bartlomiej Zolnierkiewicz
2009-03-30 17:51                                                       ` Jens Axboe
2009-03-30 17:55                                                         ` Jeff Garzik
2009-03-30 17:59                                                           ` Jens Axboe
2009-03-30 19:09                                                             ` Jeff Garzik
2009-03-30 20:56                                                               ` Bartlomiej Zolnierkiewicz
2009-03-30 22:01                                                                 ` Jeff Garzik
2009-03-30 12:33                                                   ` [PATCH 6/7] xfs: propagate issue-flush error code Fernando Luis Vázquez Cao
2009-03-30 15:20                                                     ` Bartlomiej Zolnierkiewicz
2009-03-31 23:37                                                     ` Dave Chinner
2009-04-01  3:52                                                       ` Fernando Luis Vázquez Cao
2009-03-30 12:36                                                   ` [PATCH 7/7] reiserfs: " Fernando Luis Vázquez Cao
2009-03-30 15:25                                                     ` Bartlomiej Zolnierkiewicz
2009-03-30  1:43                                           ` [PATCH 3/5] ext4: call blkdev_issue_flush on fsync Fernando Luis Vázquez Cao
2009-03-30  1:53                                           ` [PATCH 4/5] vfs: call blkdev_issue_flush from generic file_fsync helper Fernando Luis Vázquez Cao
2009-03-30  1:59                                           ` [PATCH 5/5] vfs: Add wbcflush sysfs knob to disable storage device writeback cache flushes Fernando Luis Vázquez Cao
2009-03-25 21:33                                 ` Linux 2.6.29 Jeff Garzik
2009-03-27  7:57                                 ` Jens Axboe
2009-03-27 14:13                                   ` Theodore Tso
2009-03-27 14:35                                     ` Christoph Hellwig
2009-03-27 15:03                                       ` Ric Wheeler
2009-03-27 20:38                                       ` Jeff Garzik
2009-03-28  0:14                                         ` Alan Cox
2009-03-29  8:25                                         ` Christoph Hellwig
2009-03-27 19:14                                   ` Chris Mason
2009-03-27  7:46                               ` Jens Axboe
2009-03-31 20:49                             ` Jeff Garzik
2009-03-31 22:02                               ` Ric Wheeler
2009-03-31 22:22                                 ` Jeff Garzik
2009-04-01 18:34                                   ` Mark Lord
2009-03-24 20:24               ` David Rees
2009-03-25  7:30                 ` David Rees
2009-03-24 23:03               ` Jesse Barnes
2009-03-25  0:05                 ` Arjan van de Ven
2009-03-25 17:59                   ` David Rees
2009-03-25 18:40                   ` Stephen Clark
2009-03-26 23:53                     ` Mark Lord
2009-03-25  2:09                 ` Theodore Tso
2009-03-25  3:57                   ` Jesse Barnes
2009-03-27 11:27                 ` Martin Steigerwald
2009-03-24 12:27           ` Andi Kleen
2009-04-02 14:00   ` Mathieu Desnoyers
2009-03-24 13:02 ` Revert "gro: Fix legacy path napi_complete crash", (was: Re: Linux 2.6.29) Ingo Molnar
2009-03-24 13:12   ` Ingo Molnar
2009-03-24 13:12     ` Ingo Molnar
2009-03-24 13:35   ` Herbert Xu
2009-03-24 14:06     ` Ingo Molnar
2009-03-24 14:33   ` Robert Schwebel
2009-03-24 14:39     ` Ingo Molnar
2009-03-24 15:09       ` Herbert Xu
2009-03-24 15:29         ` Sascha Hauer
2009-03-24 15:36         ` Ingo Molnar
2009-03-24 15:47           ` Ingo Molnar
2009-03-24 15:59             ` Herbert Xu
2009-03-24 16:02               ` Ingo Molnar
2009-03-24 19:19                 ` Ingo Molnar
2009-03-24 20:54                   ` Ingo Molnar
2009-03-24 21:17                     ` Revert "gro: Fix legacy path napi_complete crash", David Miller
2009-03-24 22:01                       ` Ingo Molnar
2009-03-25  0:33                   ` Revert "gro: Fix legacy path napi_complete crash", (was: Re: Linux 2.6.29) Herbert Xu
2009-03-25  0:32                 ` Herbert Xu
2009-03-25  2:09                   ` Revert "gro: Fix legacy path napi_complete crash", David Miller
2009-03-24 21:36         ` David Miller
2009-03-24 22:47           ` David Miller
2009-03-25  0:24             ` Herbert Xu
2009-03-25  0:23           ` Herbert Xu
2009-03-25  2:11             ` David Miller
2009-03-25  7:33               ` Ingo Molnar
2009-03-25  8:04                 ` David Miller
2009-03-25 12:08                 ` Herbert Xu
2009-03-25 12:20                   ` Ingo Molnar
2009-03-25 12:26                     ` Herbert Xu
2009-03-25 22:01                       ` Ingo Molnar
2009-03-25 22:20                         ` Ken Witherow
2009-03-26  9:07                         ` Herbert Xu
2009-03-26  9:25                           ` Ingo Molnar
2009-03-25 22:54                       ` Jarek Poplawski
2009-03-26  0:03                         ` David Miller
2009-03-26  0:10                         ` David Miller
2009-03-26  6:43                           ` Jarek Poplawski
2009-03-26  7:52                             ` David Miller
2009-03-26  7:59                               ` Jarek Poplawski
2009-03-26  2:41                         ` Herbert Xu
2009-03-26  3:20                           ` David Miller
2009-03-26  3:40                             ` Herbert Xu
2009-03-26  9:18                             ` Jarek Poplawski
2009-03-26  7:39                           ` Jarek Poplawski
2009-03-26  7:59                   ` David Miller
2009-03-25  9:34             ` Ingo Molnar
2009-03-24 15:22       ` Revert "gro: Fix legacy path napi_complete crash", (was: Re: Linux 2.6.29) Sascha Hauer
2009-03-27 13:35 ` Linux 2.6.29 Hans-Peter Jansen
2009-03-27 14:53   ` Geert Uytterhoeven
2009-03-27 15:46     ` Mike Galbraith
2009-03-27 16:02       ` Linus Torvalds
2009-03-28  7:50         ` Mike Galbraith
2009-03-30 22:00         ` Hans-Peter Jansen
2009-03-30 22:07           ` Arjan van de Ven
2009-03-30 10:18             ` Pavel Machek
2009-03-31 13:33             ` Rafael J. Wysocki
2009-03-31 15:30             ` Hans-Peter Jansen
2009-03-31 19:37             ` Jeff Garzik
2009-03-31 19:47               ` Arjan van de Ven
2009-04-02 19:01           ` Andreas T.Auer
2009-03-27 16:49   ` Frans Pop
     [not found] <cj1Ut-1i2-7@gated-at.bofh.it>
     [not found] ` <cj8ji-35a-5@gated-at.bofh.it>
     [not found]   ` <cj8CC-3uO-5@gated-at.bofh.it>
     [not found]     ` <cjaXP-7vB-17@gated-at.bofh.it>
     [not found]       ` <cjbhc-7VP-13@gated-at.bofh.it>
     [not found]         ` <cjbTS-AR-11@gated-at.bofh.it>
     [not found]           ` <cjcdl-10E-31@gated-at.bofh.it>
     [not found]             ` <cjeS0-5nC-33@gated-at.bofh.it>
2009-03-25 15:19               ` Bodo Eggert
     [not found]     ` <cj9oW-4PO-1@gated-at.bofh.it>
     [not found]       ` <cjkaL-5xg-15@gated-at.bofh.it>
     [not found]         ` <cjGbn-6Xx-35@gated-at.bofh.it>
     [not found]           ` <cjGl4-7ak-61@gated-at.bofh.it>
     [not found]             ` <cjJsv-3Uu-5@gated-at.bofh.it>
     [not found]               ` <cjKHX-5MF-19@gated-at.bofh.it>
     [not found]                 ` <ck7XP-JA-13@gated-at.bofh.it>
     [not found]                   ` <ck8h7-18H-1@gated-at.bofh.it>
     [not found]                     ` <ck8AA-1y0-7@gated-at.bofh.it>
     [not found]                       ` <ck8Kd-20o-7@gated-at.bofh.it>
     [not found]                         ` <ckaVR-5o8-9@gated-at.bofh.it>
     [not found]                           ` <ckbf4-5M8-7@gated-at.bofh.it>
     [not found]                             ` <ckcE8-845-9@gated-at.bofh.it>
2009-03-27 21:53                               ` Bodo Eggert
2009-03-28  6:51                                 ` Mike Galbraith
2009-03-28 12:12                                 ` Theodore Tso
     [not found]                               ` <ckdgW-uA-1@gated-at.bofh.it>
     [not found]                                 ` <ckdK0-1om-1@gated-at.bofh.it>
     [not found]                                   ` <ckiqk-uZ-13@gated-at.bofh.it>
     [not found]                                     ` <cklHL-5IW-23@gated-at.bofh.it>
     [not found]                                       ` <ckm0Q-6o4-3@gated-at.bofh.it>
     [not found]                                         ` <ckmaG-6AC-5@gated-at.bofh.it>
     [not found]                                           ` <ckmWO-7Ro-5@gated-at.bofh.it>
     [not found]                                             ` <ckn6L-84h-19@gated-at.bofh.it>
     [not found]                                               ` <cknzG-ff-7@gated-at.bofh.it>
     [not found]                                                 ` <cknJk-IB-9@gated-at.bofh.it>
     [not found]                                                   ` <ckoFm-2eg-3@gated-at.bofh.it>
     [not found]                                                     ` <ckoYP-2DC-13@gated-at.bofh.it>
2009-03-28 11:53                                                       ` Bodo Eggert
2009-03-29 14:45                                                         ` Pavel Machek
2009-03-29 15:47                                                           ` Linus Torvalds
2009-03-29 19:15                                                             ` Pavel Machek
2009-03-30 14:22                                                           ` Morten P.D. Stevens
     [not found]                         ` <ck93x-2oJ-3@gated-at.bofh.it>
2009-03-27 23:22                           ` Bodo Eggert

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.