Name Last modified Size Description
Parent Directory - taste.html 2008-10-11 04:11 52K taste.pdf 2008-10-11 04:11 89K
You (like us) have 80 to 90 Sun 3/50 machines with 4 megabytes of memory. You have been given some optical discs containing System V.4. Which can you least afford to discard? Things are getting out of hand. Soon, 32 megabyte Ultrasparks will be needed just for us to start the window system in under a minute.For UNIX, now in middle-age rotundly recalling its sprightly youth, mere `tuning' will not cause that heavy code to slip away. We need to reconsider and re-implement the system interface periodically, to take account of changes in its environment. We must be willing to throw things away, discarding parts of the older implementation completely, rather than corrupting clean new mechanisms to approximate the mistakes of the past.
To illustrate this thesis, I shall discuss work I have done on SunOS 3.5 to reduce its size and complexity. The virtual memory system has been replaced by a simpler one using ideas from the EMAS system and elsewhere. A stream IO system in the 8th/9th Edition style has been added, replacing the old character IO system, pipes, and UNIX domain sockets. I have also made some preliminary forays into the Virtual File System interface.
Here, I shall pick examples from various parts of the system to show how certain designs match the structure of UNIX well, and thus make things easier; and conversely, where there is a serious clash, the implementation becomes much more complicated and the facilities less useful. It can also be less painful in the end to replace some older things completely, rather than pile hack upon hack. I claim it is a mistake to standardise (or to have standardised) certain parts of the system interface in their current form.
/unix: 95280+20400+146252 = 261932b = 0x3ff2cb
(All kernel tables are included in the data and bss sizes; IO buffers are not.) Starting with `layers' [Pik83] code by Simon Kenyon, Nigel Roles wrote an implementation of Pike's window system mpx[Pik84] that was good enough to run programs such as the text editor jim for the Atari ST. We had a promising environment.
In 1987, however, we decided for financial reasons to switch to Suns (discless 3/50s supported by file servers). I was disconcerted, not to say disheartened, to find that the Sun kernel code was about 390 kbytes, even properly configured (as distributed it was 490k). It had notionally been built on 4.2 BSD, but the changes to support the Virtual File System and `vnodes' [Kle86] were extensive, although the character IO system and socket systems were largely unaffected.
A significant constraint was that the system must run SunOS 3.x binaries: we had software such as Framemaker for which we had no source, and programs would also need to be exportable without change.
Excluding devices, the larger components of the kernel were: the virtual memory system; the window system; the 4.2BSD File System; basic kernel functions including fork, exec, signals, and scheduling; the network file system; sockets and buffering, including pipes; virtual file system support; and the network software. I decided to look at a few of them more closely, taking advantage of the experience gained on the VAX.
Some degree of improvement was ultimately made in all areas except the Fast File System and the network side of the socket code. (Eventually both might be replaced.) The CPU scheduling code was replaced by a variant of a simple yet flexible scheduler developed at the University of Maryland [Str86].
The new system has been in use in various forms for more than a year on several Sun 3/50s and one 3/160 in the Department. Depending on configuration, it is currently about 120kbytes smaller than the old (the Sun window system still takes about 60 kbytes in the current kernel). There is still much scope for improvement, but even now the system does more, uses memory more efficiently, and is (I claim) better structured.
In following sections I shall discuss the changes to the paging system, changes made or currently being made to the virtual file system, the installation of the stream IO system, and the implications of all that on kernel implementation and applications.
The daemon checks the page table entry for the page to see if the `page used' bit is set. If not, the page can be made available. If it has been modified, it must first be written out (asynchronously). If the page has been referenced, the daemon resets the `used' bit and proceeds to the next core map entry. If the page has not been touched the next time round, it will be freed. On systems with large memories, a complete cycle can take some time, and so recent versions have the daemon cycling two pointers through the map: one clears `used' bits; the other reclaims pages.
In principle, this is a simple and elegant technique. It has been applied quite effectively in a controlled environment in a new implementation of file cacheing in UNIX using virtual memory [Bra89]. There are, however, several problems in practice with the algorithm when applied to UNIX processes, at least in its SunOS 3.5/BSD 4.2 version: the core map must keep a back reference to a page table entry, a process's working set size is not always accurate, and the memory scheduling algorithms are elaborations of the early `swapping' algorithms.
The daemon must be able to locate `the page table entry' for each page. The core map entry for a physical page must therefore point back to the page table entry that refers to it, if there is one. Map entries for data and stack pages store a (process, page) pair. Pages of an executable program, though, can be referenced by several such entries, which must all be found. The solution adopted is to store (text-table, page) in the core map. Processes using a given executable are linked in a chain from its text table entry. Whenever one of the processes' page table entries changes, owing to page in or page out, all others must change to match it. The pageout daemon looks round the list too, in order to calculate the union of used/modify bits for all processes referring to the page.
When the page daemon takes a page, it invalidates the page table entry but leaves a reference to the page (the page frame number), so that should the process soon fault on that page it can reclaim the page directly from the free list. Before any process is given a free page, the memory allocator must therefore clear any reference to it, using the data in the core map in much the same way as the pageout daemon. Just to liven things up, in 4.2BSD the network system can claim pages at interrupt level using the memory allocator, so that a process's page table entries can be changed by an interrupt! (In practice this is not a problem provided care is taken at the kernel main level, but it is something else to have to reason about when checking the code.)
To reduce overhead, the old pageout daemon does not run until the amount of free memory drops below a specified value. It then cycles round the core map at a rate that depends on the amount of memory left. If it cannot free enough memory quickly enough, the memory scheduler (swapper) is started. The swapper synchronously writes to disc clumps of the modified pages from the whole image of each process it selects.
Pointers, then, should go from the process to memory (the shared resource). Pages can be shared easily if there is a reference count for each page. In the new system, a process releases its own pages when it no longer requires them. Periodically, each process examines its page table entries and discards pages that have not been used since the last such scan. It resets the `used' bit on the remaining pages. Over time, the process tracks its own working set.
When a process discards a page during a working set scan or a complete page out, the page is put in either a free page or modified page list depending on whether it has been modified since being read in. Pages on the modified list are written to disc by a separate system process (the new `pageout daemon'). It does not write pages out immediately, but only when the modified page list grows too long, or when the free list length drops below a specified level. This increases the chance that the page can be reclaimed by its process without IO overhead. Pages are returned to the free list after they have been written out. The new pageout daemon is a simple loop: it waits for work, collects pages from the modified list, and writes them out.
By itself, use of a working set scan in each running process is not enough to prevent thrashing, since the combined working sets of all such processes might not fit in memory. The collection of processes in memory must be subject to some global constraint. The most obvious constraint to apply is that before a process is loaded there must be enough store to hold its working set. This typically changes with time, however, and so a scheduler is required that can adapt to these changes.
The EMAS system has an elegant scheduling system [She74] that was chosen for use in our system. Each process in EMAS is assigned a category, a single number that captures its current resource requirements, and also determines its CPU and memory priority, working set scan frequency, and CPU quantum size. When loaded by the scheduler, a process is given a page and CPU time allocation determined by its category. It remains loaded and in the CPU despatcher's queues until it exhausts either allocation or goes to sleep. The process must then submit itself for rescheduling.
The scheduler refers to a resource model for the system, encoded in a category table which is indexed by process categories. The table can be viewed as a little state machine with transitions from category to category. There are four transitions from each category, labelled `more memory', `more time', `unused memory', and `unused time'. The scheduler follows the appropriate path to obtain a new category that suits the process. In this way the scheduler quickly tracks the character of a process as it changes.
The rescheduled process may remain loaded if it is runnable, its new allocation fits in the available memory, and no higher priority process is waiting. It must otherwise page itself out, and join a priority queue to wait for memory. A process sleeping on an external event must likewise page itself out whilst waiting unless the memory it occupies is not in demand. When it wakes, it joins the appropriate memory queue. A memory scheduler process cycles through the priority queues using an auxiliary priority ratio table, loading each process in turn as memory becomes available. (See the EMAS papers for details. Wilkes [Wil75] includes the EMAS scheme in an interesting discussion of paging viewed as a problem in control theory.)
The notion of an `external event' is hard to sustain in modern UNIX systems. Instead, the priority of a sleep (This refers to the kernel sleep/wakeup synchronisation system, not to the C library function) is assumed to say something about the likely speed of response. When a process sleeps at high priority, it records the unused part of its quantum, adjusts its CPU priority (see below), and enters the sleep queues as usual. A process sleeping at low priority first checks with the memory scheduler before entering the sleep queues. If its memory is required now, it will page itself out, as in EMAS, but otherwise it puts itself in a priority queue of processes that are `idle' in memory.
The new memory scheduler has a simple loop: it checks the `wait memory' queues in the order given by the priority ratio table, and once it has found a process, it waits for memory to become available, as lower priority processes page themselves out. It also watches for the arrival of idle processes, and when memory is needed causes them to page themselves out, starting with low priority processes, as soon as they have been idle `long enough' (a scheduling parameter).
The original scheduler had the hardware clock interrupt increment a `time' field in each process descriptor to decide how long a process had been in or out of memory, or sleeping. Berkeley moved this to a function called periodically by a software clock, but it still required a scan of all living processes. Instead, in this system, a process is simply time stamped when it enters any (FIFO) scheduling queue. (Currently the stamp is the time in ticks since system startup.)
Fork. The fork call makes a copy of the caller's memory image. As many have observed, this is often wasteful, since in most cases the fork precedes an exec which throws the copy away! Paging everything in to make the copy will also disrupt tracking of the working set. Fortunately, copying is just the specification; we are free to implement it in the most efficient way. Many have suggested (and several have implemented) sharing memory between parent and child until one of them writes to a page; the write is intercepted, the page is copied, and the write is completed in a private copy of the page. On some machines copy-on-write is hard or costly to implement, but copy-on-reference can be used instead and should be nearly as effective in the most common case (speeding up fork before exec).
Berkeley nevertheless implemented fork by complete copy, and added a new system call vfork instead. The parent passes its page tables to the child. The parent is suspended, marked `no VM', and linked to the child, which is marked `in vfork'. The child process returns from fork and continues to run in the parent's memory until exec or exit, at which time it wakes its parent, and waits. The parent grabs its memory back, resets some pointers, and allows the child to proceed. Several parts of the system must check for `in vfork' (including the terminal handler!). The system calls exec and exit must synchronise with (the parent's) fork, and the pageout daemon must follow the vfork links to find the current user of the page if its true owner is marked `no VM'. Thus fork is moderately expensive, and vfork is rather odd.
The new system does use copy on write. The fork call copies the swap maps, incrementing the swap reference counts; copies the page tables setting `copy on write' state and read-only access; and increments a core map reference count for valid pages (a page will not be put in the free list until the reference count goes to zero). There can then be two kinds of page access faults. On a `write protect' error, a page that is no longer shared can have its permissions restored; a shared page must be copied. In either case the page becomes private to the process. (A private page also results from a `page missing' fault if the page is `fill on demand'.) Including the code to implement reference counts on swap pages, no more code is required to implement copy on write in this system than vfork took in the old.
Page theft. In the original system, the pageout daemon wanders round the core map stealing pages. Each time, it must check the state of the process owning the page. If it is in a safe state, it temporarily maps in the process's kernel data area to access the swap maps. If the page being stolen had been modified, the daemon mapped it into its own VM and started asynchronous IO.
In the new system, nothing happens `asynchronously' to a process as regards the paging system, since it either discards its own pages or must queue itself to allow the memory scheduler to do so. It simply does neither if it is in an inconvenient state, or if scheduling policy does not yet require it. The new working set scan routines pagediscard and refscan come to 51 lines including some optional event tracing calls (except when a page must be freed, no functions are called during a scan), but many more lines of code have been removed.
Reclaiming pages. In either system, a process might lose or discard a page, then require it again before it has been taken from the free list for use elsewhere. As mentioned above, the old system leaves the page table entry non-zero to allow reclaim, but clears it as soon the page is taken for use elsewhere. It also has a hash table, in which pages can be found using the pair (source-vnode,block) as a key.
In his system, Richard Miller [Mil84] had an array of disc block addresses, one per memory page, each entry containing either zero or the current swap block address of the page. When a page table entry was invalidated, the page frame number was left non-zero. Before reclaiming a page, the system checked that the swap block address matched that expected by the current process. No action was necessary if the page was claimed by another process. A similar scheme is used here (the swap address is kept in the core map, for use by the pageout daemon). There is currently no need for a hash table. Each shared image (text) entry provides a `master page table' that tracks the location of executable pages, which might page out from one frame and page in elsewhere some time later (but which should still be shared).
Swapping. A swap operation in the old system writes out all the modified pages in a process. There are special support routines that find contiguous regions of dirty pages that are then split to fit in contiguous regions of swap space. There are some others that, during page out, find modified memory pages that are adjacent in virtual memory to the one selected by the page out daemon.
In the new system, a swap is just a complete process page out, which simply frees the process's memory; clean pages go on the free list, and dirty ones go on the modified page list. If dirty page table entries are adjacent, and they are freed in sequence, and at the same time, their associated pages will be adjacent on the modified page list. The core map entries for the pages have the swap addresses attached to assist page reclaim. The pageout daemon simply checks adjacent modified page list entries to see if the swap blocks are adjacent and if so writes them out with one IO request. This gives the effect of a `partial swap' at little cost. The same effect is obtained by a process that releases adjacent dirty pages during a working set scan. The cost of a bad scheduling decision (eg, a sleeping process is swapped and immediately wakes up) is reduced, since a process can reclaim pages from the modified page list before they are written. (The perprocess data and page tables are paged out and reclaimed in the same way.)
User control. The use of working sets should help to address some of the problems mentioned by Rosenthal [Ros89], since the scheduler can more accurately track a process's memory requirements. Even an overenthusiastic process can be kept within bounds, paging mainly against itself, so that on a single-user machine for instance, a text editor running in a windowing system at the same time as such a process has a better chance of remaining responsive compared to the old system. Scheduling policy can be controlled in several ways, most simply by changing the category table, although Stephens et. al. [Ste80] do warn that the effects of doing so are not always easy to predict, and might require experiment. The scheduling parameters and tables are accessible through special files in the directory /dev/sched, making it easy to give workstation users control over the allocation of resources to processes on their machines. Windowing processes can be given a set of categories disjoint from the others, just as EMAS provided different sets of categories for interactive and batch processes.
void readi(struct inode *ip) { (*filesw[ip->i_fs->m_type].f_readi)(ip); }The function bodies are renamed and moved into files associated with the disc file system type; suitable entries are made in the switch table. More work is required to add indirection to file name lookup and when managing the `inode table' (iget, iput), and one might consider changing the calling conventions to hide from the file systems the structure of the per-process data area, but this should give the basic idea.
Within any file system, there have always been different kinds of files: directories, devices, sockets, FIFOs, symbolic links, and the `regular' file. In 7th Edition, each file is tagged with a value telling its kind (IFDIR, IFREG, etc). The internal versions of read and write can be applied to any kind of file. Reading a regular file yields its data, but reading a symbolic link produces a file name, and reading a directory gives a list of (inode-number, name) pairs in some format.
The internal version of write can also write an (inode-number, name) pair into a directory; it thus creates a link.
The same scheme can be used with the file system switch. The implementation of `read' for the file system will look at the kind of file to decide what to do. Note that all file systems must agree on a standard format in which they will exchange directory entries with the kernel during read and write. This need not be related to the directory format that the file system uses internally. For example, /proc makes up directory entries `on the fly' out of the contents of the kernel's process table. It is not necessary to add a special `readdir' or `getdirents' call.
Implementation in this way puts the file system switch below the bulk of the UNIX system call/file system interface code. The job of implementing UNIX semantics (pathnames, checking permissions, managing times, inode data structure locking, etc.) is mostly done above that. The file system writer can concentrate on the rewarding job of obtaining the data to make curious things look like files and directories to processes.
Sun's system is avowedly `object oriented': the group of function pointers linking to the operations is attached directly to the data structure or `object' representing the file system (VFS) or the file (vnode). One might then expect each vnode to accept just the operations it implements: the operation structure attached to a `directory' in some file system would accept `lookup', `rename', and other such directory requests, but its entry for `readlink' would point to an error routine. Similarly, a `symbolic link' would reject many other operations but accept `readlink'. That is not what happens. In all the file systems in SunOS 3.5, every vnode in a given file system has the same operation set, which accepts any operation that is allowed by some kind of file in the file system. Each operation then checks a `file kind' tag to see what sort of file it is, just like the other systems. (Indeed, it is sometimes checked twice: at the system call level, and by the vnode operation. The second check is not redundant, since the NFS server takes a shortcut, bypassing the first check by calling vnode operations directly.) As things stand, the operations are not really associated with the vnode at all, but with the collection of vnodes inside a particular file system, and the contents of the operation set are known statically.
Putting the pointer in the vnode itself is just an optimisation.
On the VAX system, the only file system types were UNIX ones, and so UNIX semantics had been implemented just once above the file systems by the system call level. I decided to provide a file system `base implementation' that would support UNIX file system implementations. The kernel continues to operate on vnodes using Sun's interface and operations, although the VFS level was changed to provide vnode locking primitives, as a service to be used as required by the lower levels. Most of the real work is done by a new file system type.
The unix file type attends to the standard UNIX semantics for files, including maintenance of user and group IDs, protections, file mode, file times, size, provision of a basic set of values for stat (`getattr'), and a basic protection policy. It simplifies implementation of the proc file system, a memory file system (which allows a directory structure and files to be created in main memory), and some others to be described presently. It currently does not handle the ufs file system type (the 4.2 disc file system).
It also supports several types of anonymous vnodes, file types that lack names in the file system hierarchy: currently pipe, fifo, socket, and config are included. The pipe and fifo types provide vnodes for all stream IO (not just IO on pipes), the fifo subtype differing mainly in the open/close protocol. The socket and config types need more explanation.
Having all files in the system accessible through the file system interface (inode or vnode) makes it more likely that applications can use any kind of file in most contexts.
A character device is represented by open, close, read, write, ioctl and select operations selected by indirection through a `device switch'. Since vnodes provide a superset of these operations and the operation set can be unique to the vnode, perhaps the device switch is redundant.
The (experimental) config file system can potentially replace the device switches. It provides a device name space (hierarchy) derived automatically from the machine's configuration.
Each file in it has its own set of operations that refer to the appropriate device handler routines. Multiplexors are represented by directories that generate channel vnodes on demand. (This might serve to eliminate major/minor device numbers from the system, removing the need to make them larger on big systems.) Config differs from specfs (Sun's special devices file system) in that it provides a mountable name space. Specfs makes a vnode for a device; config's vnodes are the devices (or pseudo-devices). It is an experimental file system because I am uncertain how best to handle the file attributes. Originally in UNIX, all file systems were on disc, saving the permissions, ownership, and file times of device inodes over a reboot. Currently, because config is a virtual hierarchy, without persistent store, attributes on some files must be set by /etc/rc when the default attributes are not acceptable. This might turn out to be the best way.
These help to make streams neutral in their application in the same way that any data can be written in any format to a UNIX file, or sent through a UNIX pipeline. By contrast, sockets provide options for datagrams, streams, sequenced packets, and a `raw' interface: they clearly provide a networking interface.
There are two ways of dealing with networks in UNIX using streams: make the complex network look like a collection of simple streams [Pre85, Pre86], or change the stream system to look more like a network. AT&T took the latter approach when the stream IO system became ``STREAMS technology'' in System V. It acquired some new features: multiplexor drivers; the contents of a stream is now a sequence of messages, which are themselves sequences of bytes; and there are now options at the stream head, including IO modes. IO modes include `message discard', `message non-discard', and `byte stream': they provide different ways of reading the data on the stream.
The result is something of a cross between the original stream system and sockets. The basic STREAMS implementation is consequently somewhat larger than mine: the support for streams, queues, and buffering takes 29,640 bytes in SunOS 4.0, compared to my system's 9,180 bytes. (The STREAMS figure does not include support added recently for independent `out of band' message sequences within one stream [Rag89].)
After each stage the new kernel was indeed smaller, even though some new functionality had been added (for instance, `message' and `connection' modules) [Pre85]. Certainly, one hopes for this result when replacing an accumulation of many special cases and conventions by something with a clean, general model. Much of the saving comes from eliminating device and line discipline `read' and `write' interface routines (the streams code does the work).
Sun added STREAMS to SunOS 4.0. In every case the comparable STREAMS versions have grown slightly larger, not smaller, than the components they replace. (The compatibility module ttcompat is smaller than the old tty driver, but it simply translates ioctl calls for the new tty module ldterm, which is larger.) The increase in the size of the tty modules on conversion to STREAMS is possibly because the new ones implement all the System V conventions. Unfortunately, some of those were artifacts of the old character IO system, and were overdue for replacement.
A good example is the MIN/TIME facility of the old terminal driver [Mic88]. It takes a full A4 page of typeset text to describe the effect of different combinations of MIN and TIME values, and the restrictions! One case provided buffering on a bursty asynchronous line (eg, for uucp): the driver waits until characters stop arriving frequently enough or a minimum number accumulates. Another allowed a process to wait until at least one character arrived or a timer expired. The first case could now be handled by pushing onto the stream a buffering module similar to bufld(4) in the Ninth Edition [Lab86], which can be applied to any stream, not just one with tty editing. (In fact, some programs that currently use MIN/TIME do not want tty editing.) The second looks like an application for poll or select. Instead, new mechanisms were added at the stream head to support the feature.
Streams also allow (indeed encourage) modules to be composed freely. Thus with streams, a Berkeley pseudo-terminal can be replaced by a pipe with message and tty modules installed appropriately. Neither module knows of the other's existence. (We implement simple text windows on our Sun 3/50s in this way; a more useful `editing' shell window needs only the message module.) Of course, the particular application program must be revised accordingly, since the interfaces are radically different. In SunOS, the pseudo-terminals were instead `converted' to streams, but their rather intricate nature required the use of an `intra-stream' protocol between the pseudo-terminal parts and ldterm.
By re-implementation, we mean re-programming and altering the structure of an existing system where necessary, so as to better implement the function of that system. We distinguish re-implementation from two other activities. These are the transportation of an existing system to new hardware, and the design of a new system. If we transport a system, we leave its function and its structure unaltered. Re-implementation is not merely a combination of transportation and re-design, but is a unique activity in its own right. It is doing it again, but very well this time.[She77] [my italics]
Re-implementation treats the old version of the system as a prototype of the desired system. Functions may be redistributed, added, or dropped. At no point, however, is the intent to provide a base for a completely different system: all the mechanisms are honed to implement precisely the key ideas of the original system.
For example, Andrew Hume's mk[Hum87] is a very good re-implementation of make: the underlying model and syntax of the two programs are obviously closely related, but in the new program the old ideas have been carefully refined, and it interacts more smoothly with the surrounding environment. In contrast, both Mach [Acc86] and Amoeba [Mul89] are new systems that can emulate the UNIX environment well enough to run UNIX programs, but the UNIX interface is not necessarily the best way to use them.
I think the essential ideas of UNIX are still worthwhile, but I also think it will take a good re-implementation to bring new life to them.
It is dangerous to place too much hope in any improvement coming from just following new fashions, if we lack insight into what really went wrong before. Without that insight, I suspect that rewriting UNIX in C++, for example, could easily become an excuse for increasing complexity (because by using C++ `we can handle more complexity').
The system we use has been put together using a traditional UNIX approach: make a careful selection of mechanisms, and if possible, steal good ideas from elsewhere, but ensure they work together and fit the underlying system structure. The result is a more powerful system which is less greedy for memory (and my time). It still leaves room for Thompson's ``much-needed gap'' [Tho75].
Whysall and several others in the Department have been `friendly users' of this software, made many good suggestions, and have given much encouragement. Dennis Ritchie very kindly sent a copy of the Ninth Edition manual.