$Header: /home/swa/swa-CVS/public_html/Linux/Docs/vm-bdflush.txt,v 1.2 2003/12/19 18:52:30 swa Exp $ ------------------------------------------------------------------------------ The /proc/sys/vm/bdflush sysctl interface. ------------------------------------------------------------------------------ Written by Steven Augart May 19, 2002 Revised May 7 -- 9, 2003 Updated December 19, 2003, to include kernel 2.4.20.SuSE CONTENTS: Introduction Credits Why I Wrote This Why we care about past kernel versions History Inaccurate (but widely circulated) documentation What - Table of /proc/sys/vm/bdflush parameters - nfract - ndirty - nrefill - nref_dirt - interval - age_buffer - age_super - nfract_sync End-notes INTRODUCTION: This file documents the sysctl file /proc/sys/vm/bdflush, which controls the operation of the bdflush and kupdated kernel daemons. CREDITS: This incorporates text from from Rik van Riel's excellent "Documentation for /proc/sys/vm/*", which appears (in at least two versions) in the 2.2.x and 2.4.x Linux kernel series as Documentation/sysctl/vm.txt. WHY I WROTE THIS: I wrote this document because I was trying to maximize my laptop's battery life. (I frequently fly across the USA, from my home in Venice Beach, California, to visit my beloved grandmother near Boston. This flight takes over five hours eastbound and six hours westbound.) One way to extend battery life is to keep the disk drive from spinning up. (See http://www.augart.com/thinkpad600/hard-drive-spindown-timeout.html for some calculations on this topic.) WHY WE CARE ABOUT PAST KERNEL VERSIONS: I started to write a laptop disk usage control panel, and found that the existing documentation on the bdflush sysctl file was out-of-date. I then learned that the kernel's interpretation of the sysctl file's contents has changed with different kernel versions. This made me realize that the laptop control panel was going to have to know about different behavior for different kernel versions. Otherwise, it might do the wrong thing as soon as I changed kernel versions or distributed it to someone who used it on another kernel version. It is not adequate to provide informaiton on what one kernel version does. Developers want to write utilities that other folks can use. Users want to have some chance of their programs working when they upgrade kernels. Linux distributors don't want to have to recompile other packages when they update the kernel. In writing this document, I wanted to make sure to do the Right Thing (tm). This means having documentation that is useful to someone writing programs that modify the bdflush parameters. As I discuss above, giving some historical details of each bdflush parameter looks like the best way of avoiding trouble. HISTORY: I examined linux kernels 0.01, 0.12, 0.95, 0.99.15, 1.0, 1.2.0, 2.0.1, 2.0.30, 2.2.0, 2.2.10, 2.2.20, 2.4.0, 2.4.5, 2.4.10, 2.4.13, 2.4.16, 2.4.18, 2.4.19-pre8 (hereafter referred to as 2.4.19), 2.4.20.SuSE, 2.5.5, 2.5.11, and 2.5.15. linux 0.01 didn't have sysctl files :) 0.12 didn't have a bdflush daemon :) 0.95 ditto, 0.99.14 has the first proto-parameters. Once upon a time (in the 1970s) there was just a system called Unix. And by version 1, it had a system call named `sync'. `sync' would flush out all the dirty buffers to disk. Until `sync' was called, Unix would just cheerfully keep sticking pending disk writes into dirty buffers until it had no more space. (1) So a daemon (/sbin/update) was added to automatically call `sync' every thirty seconds. That meant that you couldn't lose more than a bit over thirty seconds worth of data. This seemed reasonable. Linus used this approach for the Linux kernel. Right from kernel version 0.01, there was a sync() system call. The idea was that you'd have an update daemon (/sbin/update) to do the writes. Eric Youngdale implemented some code to improve matters. [ TODO: Add more history here ] In December 1995, Paul Gortmaker removed the necessity for a syscall to start bdflush(); he made it so that we can use a kernel thread instead. 1n 1999, Andrea Arcangeli added asynchronous buffer flushing. In the 2.4 kernel series. the bdflush code is primarily in fs/buffer.c. By kernel version 2.4.20.SuSE, the bdflush parameters are defined in a separate file, include/linux/bdf_prm.h. INACCURATE (but widely circulated) DOCUMENTATION: Documentation claiming to reflect on how they worked in 2.2.10 is in Documentation/sysctl/vm.txt in the 2.4.x and 2.5.x kernel source trees. It is mostly accurate, but some aspects of its description of bdflush reflect features of the 2.4.x kernels, and the version number in the source wasn't updated. The other widely available document for it, Documentation/filesystems/proc.txt, merged in Rik van Riel's version of "vm.txt" for kernel version 2.1.128. "proc.txt", as shipped in the most recent kernels (2.4.19, 2.5.15) states that it was updated to 2.4.0, This update appears to have left Rik van Riel's 2.1.128 material untouched. WHAT: /proc/sys/vm/bdflush contains nine integer values. Different kernel versions use different ones of these values. Even more exciting is that one of them (parameter 7, nfract_sync/age_super) changed meaning in each of the 2.2, 2.4, and 2.5 kernel series. They are listed in the Table in order, along with the kernel versions I sampled in the 2.0.x, 2.2.x, and 2.4.x series in which they are or are not used. None of them are used in 2.5.15; the 2.5.x kernel series is going over to a new subsystem, called `pdflush'. Table: Parameters in /proc/sys/vm/bdflush .............................................................................. # Value Meaning 1 nfract Percentage of buffer cache dirty to activate bdflush [ All versions ] 2 ndirty Maximum number of dirty blocks to write out per wake-cycle [In 1.2.0, 2.0.1, 2.2.0, 2.2.10, 2.2.20, 2.4.0, 2.4.16, 2.4.20.SuSE] 2 dummy [Unused in 2.4.18, 2.4.19] 3 nrefill Number of clean buffers to try to obtain each time we call refill [In 1.2.0, 2.0.1, 2.2.0, 2.2.10, 2.2.20, 2.4.0] 3 dummy [2.4.16, 2.4.19, 2.4.20, and the 2.4.16 vm.txt]. 4 nref_dirt Dirty buffer threshold for activating bdflush when trying to refill buffers. [In 2.0.1, 2.2.0, 2.2.10, 2.2.20] 4 dummy [vm.txt, 2.4.0, 2.4.16, 2.4.19, 2.4.20] 5 interval Jiffies delay between kupdated flushes. [in 2.2.20, 2.4.0, 2.4.16, 2.4.18, 2.4.19, 2.4.20] 5 dummy [Unused in 2.2.0, 2.2.10] 5 clu_nfract [In 1.2.0 and 2.0.1, this was the %age of the buffer cache to scan to search for free clusters.] 6 age_buffer Time for normal buffer to age before we flush it 7 nfract_sync Percentage of buffer cache dirty to activate bdflush synchronously. [In vm.txt, 2.4.0, 2.4.16, 2.4.18, 2.4.19, 2.4.20] 7 age_super Time for superblock to age before we flush it. [In 1.2.0, 2.0.1, 2.2.0, 2.2.10, 2.2.20, and proc.txt: ] 8 dummy [Unused in 2.2.0, 2.2.10, 2.2.20, 2.4.0, 2.4.18, and 2.4.19] 8 nfract_stop_bdflush Percentage of buffer cache dirty to stop bdflush. [2.4.16 and 2.4.20.SuSE] 8 lav_const [1.2.0, 2.0.1] Constant used for load average. 9 dummy Unused 9 lav_ratio Used to determine how low a lav for a particular size can go before we start to trim back the buffers [In 1.2.0, 2.0.1] .............................................................................. nfract ------ This parameter is a percentage. It governs the maximum number of dirty buffers in the buffer cache before the kernel will flush some of them. Dirty means that the contents of the buffer still have to be written to disk (as opposed to a clean buffer, which can just be forgotten about). Setting this to a higher value means that Linux can delay disk writes for a long time, but it also means that it will have to do a lot of I/O at once when memory becomes short. A lower value will spread out disk I/O more evenly, at the cost of more frequent I/O operations. . In the 2.2.x kernel series, this I/O is synchronous. In the 2.4.x series, this is queued asynchronous I/O (more on that below) and there is another (presumably larger) parameter, nfract_sync, that governs the percentage threshold for asynchronous I/O. 2.4.x kernels distinguish between synchronous and asynchronous flushing. Synchronous flushing happens right away; Linux won't wait for user-requested I/O to complete. Linux schedules Asynchronous flushing to occur as soon as outstanding user I/O requests complete, so that it will (theoretically) not bother regular users. nfract governs asynchronous flushing, while nfract_sync (below) governs synchronous flushing. Defaults to 30% [2.4.0, 2.4.16], 40% [2.2.0, 2.2.20, 2.4.18, 2.4.19], or 50% [ 2.4.20.SuSE ]. Historic defaults: 60% in 2.0.1, 25% in 1.2.0. Allowed range: May vary between 0% and 100%. ndirty [Unused in 2.4.18, 2.4.19] ------ Ndirty gives the maximum number of dirty buffers that the kernel bdflush daemon can write to the disk at one time. A high value will mean delayed, bursty I/O, while a small value can lead to memory shortage when bdflush isn't woken up often enough. Defaults to 500 buffers [2.2.0, 2.2.20, 2.4.16], 64 buffers [2.4.0]. Allowed range: in 2.4.16 and 2.4.20.SuSE may vary between 1 buffer and 50,000 buffers. Historic Range: in 2.2.0, range was 10 buffers -> 5000 buffers. nrefill [Unused in 2.4.16, 2.4.18, 2.4.19, 2.4.20.SuSE] ------- This is the number of buffers that bdflush will add to the list of free buffers when refill_freelist() is called. It is necessary to allocate free buffers beforehand, since the buffers are often different sizes than the memory pages and some bookkeeping needs to be done beforehand. The higher the number, the more memory will be wasted and the less often refill_freelist() will need to run. Defaults to 64 buffers [2.2.10, 2.4.0], 256 buffers [2.2.0]. In 2.4.20.SuSE, where it's unused, the default is 0. nref_dirt [Unused in 2.4.0, 2.4.16, 2.4.19] --------- When refill_freelist() comes across more than nref_dirt dirty buffers, it will wake up bdflush. Defaults to 256 buffers [2.2.10, 2.2.20]. In 2.4.20.SuSE, where it's unused, the default is 0. interval [Unused in 2.2.0, 2.2.10, and proc.txt docs] -------- This is how often the kernel update daemon [kupdated] is scheduled to run. The time interval is expressed in jiffies (clockticks). The number of jiffies per second is 100 on most platforms, and 1024 on the Alpha. Thus,, if you read x*HZ in the kernel sources, that means x seconds. Setting interval to 0 jiffies is magic; this means that the next time kupdated wakes up, it should immediately suspend itself and not wake up again until (unless) somebody sends the kupdated process a SIGCONT. (You can accomplish the same goal by sending the kupdated process a SIGSTOP.) Defaults [in 2.2.20, 2.4.16, 2.4.18, 2.4.20.SuSE] to once every 5 seconds. Allowed range: 0 seconds (= 0 jiffies, see above) to 10,000 seconds. age_buffer ---------- How old a dirty data buffer can be before kupdated will write it out to disk. The value is expressed in jiffies (clockticks). So, no dirty data buffer may be older than age_buffer jiffies. Defaults [in 2.2.0, 2.2.10, 2.2.20, 2.4.16, 2.4.18, 2.4.20.SuSE] to 30 seconds. Smallest possible value is 1 second. Largest is 10,000 seconds in kernels 2.4.16 and 2.4.20.SuSE. age_super [unused in vm.txt docs, 2.4.0, 2.4.16, 2.4.19] --------- How old can a dirty meta-data buffer be before the kernel writes it out to disk? The value is expressed in jiffies (clockticks). This is the meta-data counterpart of age_buffer. nfract_sync [2.4.0, 2.4.16, 2.4.18, 2.4.19, 2.4.20.SuSE] ----------- What percentage of the buffer cache being dirty will the kernel tolerate before balance_dirty() and try_to_flush_dirty_buffers() start flushing blocks synchronously? This can be viewed as the hard limit before bdflush forces buffers to disk. When the kernel flushes blocks synchronously, it will go ahead and flush even if this will slow the user's work down. This does not affect the kupdated and its regularly scheduled flushes of old buffers. Setting the value to 0% means that the kernel will synchronously flush every time a block becomes dirty. This should bring your system to a crawl, especially if you're running multi-user with various background daemons. Setting nfract_sync to 100% appears to mean that Linux will not flush the buffer cache even if it becomes completely dirty. You might never get to 100% if kupdated always flushes old dirty buffers before the user programs can fill up the buffer cace with new dirty buffers. I believe (but haven't read through all the relevant source code) that a completely dirty buffer cache will cause the kernel to panic as soon as it needs a free buffer. Defaults [in 2.4.0, 2.4.16, 2.4.18, 2.4.20.SuSE] to 60%. In keeping with the Linux philosophy, the minimum is 0% and the maximum is 100%. nfract_stop_bdflush [2.4.16 and 2.4.20.SuSE] ------------------- Percentage of buffer cache dirty to stop bdflush. Defaults [ in 2.4.20.SuSE] to 20%. Minimum is 0% and maximum is 100%. End-notes: (1) I don't know what Unix versions 1 through 7 would have done if they had ever actually filled up all of the kernel's buffer space with dirty data. The two likelihoods are that they either went into emergency mode and started flushing buffers as-needed, or that the kernel panicked.