$Header: /home/swa/swa-CVS/public_html/Linux/Docs/vm-bdflush.txt,v 1.2 2003/12/19 18:52:30 swa Exp $
------------------------------------------------------------------------------
The /proc/sys/vm/bdflush sysctl interface.
------------------------------------------------------------------------------
Written by Steven Augart May 19, 2002
Revised May 7 -- 9, 2003
Updated December 19, 2003, to include kernel 2.4.20.SuSE
CONTENTS:
Introduction
Credits
Why I Wrote This
Why we care about past kernel versions
History
Inaccurate (but widely circulated) documentation
What
- Table of /proc/sys/vm/bdflush parameters
- nfract
- ndirty
- nrefill
- nref_dirt
- interval
- age_buffer
- age_super
- nfract_sync
End-notes
INTRODUCTION: This file documents the sysctl file
/proc/sys/vm/bdflush, which controls the operation of the bdflush and
kupdated kernel daemons.
CREDITS: This incorporates text from from Rik van Riel's excellent
"Documentation for /proc/sys/vm/*", which appears (in at least two
versions) in the 2.2.x and 2.4.x Linux kernel series as
Documentation/sysctl/vm.txt.
WHY I WROTE THIS:
I wrote this document because I was trying to maximize my laptop's
battery life. (I frequently fly across the USA, from my home in
Venice Beach, California, to visit my beloved grandmother near Boston.
This flight takes over five hours eastbound and six hours westbound.)
One way to extend battery life is to keep the disk drive from spinning
up. (See
http://www.augart.com/thinkpad600/hard-drive-spindown-timeout.html for
some calculations on this topic.)
WHY WE CARE ABOUT PAST KERNEL VERSIONS:
I started to write a laptop disk usage control panel, and found that
the existing documentation on the bdflush sysctl file was out-of-date.
I then learned that the kernel's interpretation of the sysctl file's
contents has changed with different kernel versions. This made me
realize that the laptop control panel was going to have to know about
different behavior for different kernel versions. Otherwise, it might
do the wrong thing as soon as I changed kernel versions or distributed
it to someone who used it on another kernel version.
It is not adequate to provide informaiton on what one kernel version
does. Developers want to write utilities that other folks can use.
Users want to have some chance of their programs working when they
upgrade kernels. Linux distributors don't want to have to recompile
other packages when they update the kernel.
In writing this document, I wanted to make sure to do the Right Thing
(tm). This means having documentation that is useful to someone
writing programs that modify the bdflush parameters. As I discuss
above, giving some historical details of each bdflush parameter looks
like the best way of avoiding trouble.
HISTORY:
I examined linux kernels 0.01, 0.12, 0.95, 0.99.15, 1.0, 1.2.0, 2.0.1,
2.0.30, 2.2.0, 2.2.10, 2.2.20, 2.4.0, 2.4.5, 2.4.10, 2.4.13, 2.4.16,
2.4.18, 2.4.19-pre8 (hereafter referred to as 2.4.19), 2.4.20.SuSE,
2.5.5, 2.5.11, and 2.5.15.
linux 0.01 didn't have sysctl files :)
0.12 didn't have a bdflush daemon :)
0.95 ditto, 0.99.14 has the first proto-parameters.
Once upon a time (in the 1970s) there was just a system called Unix.
And by version 1, it had a system call named `sync'. `sync' would
flush out all the dirty buffers to disk.
Until `sync' was called, Unix would just cheerfully keep sticking
pending disk writes into dirty buffers until it had no more
space. (1)
So a daemon (/sbin/update) was added to automatically call `sync'
every thirty seconds. That meant that you couldn't lose more than a
bit over thirty seconds worth of data. This seemed reasonable.
Linus used this approach for the Linux kernel. Right from kernel
version 0.01, there was a sync() system call. The idea was that
you'd have an update daemon (/sbin/update) to do the writes.
Eric Youngdale implemented some code to improve matters.
[ TODO: Add more history here ]
In December 1995, Paul Gortmaker removed the necessity for a syscall
to start bdflush(); he made it so that we can use a kernel thread
instead.
1n 1999, Andrea Arcangeli added asynchronous buffer flushing.
In the 2.4 kernel series. the bdflush code is primarily in
fs/buffer.c. By kernel version 2.4.20.SuSE, the bdflush parameters are
defined in a separate file, include/linux/bdf_prm.h.
INACCURATE (but widely circulated) DOCUMENTATION:
Documentation claiming to reflect on how they worked in 2.2.10 is in
Documentation/sysctl/vm.txt in the 2.4.x and 2.5.x kernel source
trees. It is mostly accurate, but some aspects of its description of
bdflush reflect features of the 2.4.x kernels, and the version number
in the source wasn't updated. The other widely available document for
it, Documentation/filesystems/proc.txt, merged in Rik van Riel's
version of "vm.txt" for kernel version 2.1.128. "proc.txt", as shipped
in the most recent kernels (2.4.19, 2.5.15) states that it was updated
to 2.4.0, This update appears to have left Rik van Riel's 2.1.128
material untouched.
WHAT:
/proc/sys/vm/bdflush contains nine integer values. Different kernel
versions use different ones of these values. Even more exciting is
that one of them (parameter 7, nfract_sync/age_super) changed meaning
in each of the 2.2, 2.4, and 2.5 kernel series.
They are listed in the Table in order, along with the kernel versions
I sampled in the 2.0.x, 2.2.x, and 2.4.x series in which they are or
are not used.
None of them are used in 2.5.15; the 2.5.x kernel series is
going over to a new subsystem, called `pdflush'.
Table: Parameters in /proc/sys/vm/bdflush
..............................................................................
# Value Meaning
1 nfract Percentage of buffer cache dirty to activate bdflush
[ All versions ]
2 ndirty Maximum number of dirty blocks to write out per wake-cycle
[In 1.2.0, 2.0.1, 2.2.0, 2.2.10, 2.2.20, 2.4.0, 2.4.16,
2.4.20.SuSE]
2 dummy [Unused in 2.4.18, 2.4.19]
3 nrefill Number of clean buffers to try to obtain each time we call refill
[In 1.2.0, 2.0.1, 2.2.0, 2.2.10, 2.2.20, 2.4.0]
3 dummy [2.4.16, 2.4.19, 2.4.20, and the 2.4.16 vm.txt].
4 nref_dirt Dirty buffer threshold for activating bdflush when trying
to refill buffers. [In 2.0.1, 2.2.0, 2.2.10, 2.2.20]
4 dummy [vm.txt, 2.4.0, 2.4.16, 2.4.19, 2.4.20]
5 interval Jiffies delay between kupdated flushes.
[in 2.2.20, 2.4.0, 2.4.16, 2.4.18, 2.4.19, 2.4.20]
5 dummy [Unused in 2.2.0, 2.2.10]
5 clu_nfract [In 1.2.0 and 2.0.1, this was the %age of the buffer cache to scan
to search for free clusters.]
6 age_buffer Time for normal buffer to age before we flush it
7 nfract_sync Percentage of buffer cache dirty to activate bdflush
synchronously. [In vm.txt, 2.4.0, 2.4.16, 2.4.18, 2.4.19, 2.4.20]
7 age_super Time for superblock to age before we flush it.
[In 1.2.0, 2.0.1, 2.2.0, 2.2.10, 2.2.20, and proc.txt: ]
8 dummy [Unused in 2.2.0, 2.2.10, 2.2.20, 2.4.0, 2.4.18, and 2.4.19]
8 nfract_stop_bdflush Percentage of buffer cache dirty to stop bdflush.
[2.4.16 and 2.4.20.SuSE]
8 lav_const [1.2.0, 2.0.1] Constant used for load average.
9 dummy Unused
9 lav_ratio Used to determine how low a lav for a particular
size can go before we start to trim back the buffers
[In 1.2.0, 2.0.1]
..............................................................................
nfract
------
This parameter is a percentage. It governs the maximum number of dirty
buffers in the buffer cache before the kernel will flush some of them.
Dirty means that the contents of the buffer still have to be written to disk
(as opposed to a clean buffer, which can just be forgotten about). Setting
this to a higher value means that Linux can delay disk writes for a long time,
but it also means that it will have to do a lot of I/O at once when memory
becomes short. A lower value will spread out disk I/O more evenly, at the cost
of more frequent I/O operations. .
In the 2.2.x kernel series, this I/O is synchronous. In the 2.4.x series,
this is queued asynchronous I/O (more on that below) and there is another
(presumably larger) parameter, nfract_sync, that governs the percentage
threshold for asynchronous I/O.
2.4.x kernels distinguish between synchronous and asynchronous flushing.
Synchronous flushing happens right away; Linux won't wait for user-requested
I/O to complete. Linux schedules Asynchronous flushing to occur as soon as
outstanding user I/O requests complete, so that it will (theoretically) not
bother regular users. nfract governs asynchronous flushing, while nfract_sync
(below) governs synchronous flushing.
Defaults to 30% [2.4.0, 2.4.16], 40% [2.2.0, 2.2.20, 2.4.18, 2.4.19],
or 50% [ 2.4.20.SuSE ]. Historic defaults: 60% in 2.0.1, 25% in 1.2.0.
Allowed range: May vary between 0% and 100%.
ndirty [Unused in 2.4.18, 2.4.19]
------
Ndirty gives the maximum number of dirty buffers that the kernel bdflush
daemon can write to the disk at one time. A high value will mean delayed,
bursty I/O, while a small value can lead to memory shortage when bdflush isn't
woken up often enough.
Defaults to 500 buffers [2.2.0, 2.2.20, 2.4.16], 64 buffers [2.4.0].
Allowed range: in 2.4.16 and 2.4.20.SuSE may vary between 1 buffer
and 50,000 buffers.
Historic Range: in 2.2.0, range was 10 buffers -> 5000 buffers.
nrefill [Unused in 2.4.16, 2.4.18, 2.4.19, 2.4.20.SuSE]
-------
This is the number of buffers that bdflush will add to the list of free
buffers when refill_freelist() is called. It is necessary to allocate free
buffers beforehand, since the buffers are often different sizes than the
memory pages and some bookkeeping needs to be done beforehand. The higher the
number, the more memory will be wasted and the less often refill_freelist()
will need to run.
Defaults to 64 buffers [2.2.10, 2.4.0], 256 buffers [2.2.0].
In 2.4.20.SuSE, where it's unused, the default is 0.
nref_dirt [Unused in 2.4.0, 2.4.16, 2.4.19]
---------
When refill_freelist() comes across more than nref_dirt dirty buffers, it will
wake up bdflush.
Defaults to 256 buffers [2.2.10, 2.2.20]. In 2.4.20.SuSE, where it's
unused, the default is 0.
interval [Unused in 2.2.0, 2.2.10, and proc.txt docs]
--------
This is how often the kernel update daemon [kupdated] is scheduled to run.
The time interval is expressed in jiffies (clockticks). The number of jiffies
per second is 100 on most platforms, and 1024 on the Alpha. Thus,, if you
read x*HZ in the kernel sources, that means x seconds.
Setting interval to 0 jiffies is magic; this means that the next time kupdated
wakes up, it should immediately suspend itself and not wake up again until
(unless) somebody sends the kupdated process a SIGCONT. (You can accomplish
the same goal by sending the kupdated process a SIGSTOP.)
Defaults [in 2.2.20, 2.4.16, 2.4.18, 2.4.20.SuSE] to once every 5
seconds.
Allowed range: 0 seconds (= 0 jiffies, see above)
to 10,000 seconds.
age_buffer
----------
How old a dirty data buffer can be before kupdated will write it out
to disk. The value is expressed in jiffies (clockticks). So, no
dirty data buffer may be older than age_buffer jiffies.
Defaults [in 2.2.0, 2.2.10, 2.2.20, 2.4.16, 2.4.18, 2.4.20.SuSE] to 30 seconds.
Smallest possible value is 1 second. Largest is 10,000 seconds in
kernels 2.4.16 and 2.4.20.SuSE.
age_super [unused in vm.txt docs, 2.4.0, 2.4.16, 2.4.19]
---------
How old can a dirty meta-data buffer be before the kernel writes it out to
disk? The value is expressed in jiffies (clockticks). This is the meta-data
counterpart of age_buffer.
nfract_sync [2.4.0, 2.4.16, 2.4.18, 2.4.19, 2.4.20.SuSE]
-----------
What percentage of the buffer cache being dirty will the kernel tolerate
before balance_dirty() and try_to_flush_dirty_buffers() start flushing blocks
synchronously? This can be viewed as the hard limit before bdflush forces
buffers to disk. When the kernel flushes blocks synchronously, it will go
ahead and flush even if this will slow the user's work down.
This does not affect the kupdated and its regularly scheduled flushes of old
buffers.
Setting the value to 0% means that the kernel will synchronously flush every
time a block becomes dirty. This should bring your system to a crawl,
especially if you're running multi-user with various background daemons.
Setting nfract_sync to 100% appears to mean that Linux will not flush the
buffer cache even if it becomes completely dirty. You might never get to 100%
if kupdated always flushes old dirty buffers before the user programs can fill
up the buffer cace with new dirty buffers. I believe (but haven't read
through all the relevant source code) that a completely dirty buffer cache
will cause the kernel to panic as soon as it needs a free buffer.
Defaults [in 2.4.0, 2.4.16, 2.4.18, 2.4.20.SuSE] to 60%. In keeping
with the Linux philosophy, the minimum is 0% and the maximum is 100%.
nfract_stop_bdflush [2.4.16 and 2.4.20.SuSE]
-------------------
Percentage of buffer cache dirty to stop bdflush.
Defaults [ in 2.4.20.SuSE] to 20%. Minimum is 0% and maximum is 100%.
End-notes:
(1) I don't know what Unix versions 1 through 7 would have done if
they had ever actually filled up all of the kernel's buffer space with
dirty data. The two likelihoods are that they either went into
emergency mode and started flushing buffers as-needed, or that the
kernel panicked.