diff options
Diffstat (limited to 'Documentation/filesystems')
-rw-r--r-- | Documentation/filesystems/00-INDEX | 4 | ||||
-rw-r--r-- | Documentation/filesystems/Locking | 9 | ||||
-rw-r--r-- | Documentation/filesystems/locks.txt | 67 | ||||
-rw-r--r-- | Documentation/filesystems/mandatory-locking.txt | 171 | ||||
-rw-r--r-- | Documentation/filesystems/ntfs.txt | 4 | ||||
-rw-r--r-- | Documentation/filesystems/vfs.txt | 51 |
6 files changed, 297 insertions, 9 deletions
diff --git a/Documentation/filesystems/00-INDEX b/Documentation/filesystems/00-INDEX index 59db1bca7027..599593a17067 100644 --- a/Documentation/filesystems/00-INDEX +++ b/Documentation/filesystems/00-INDEX @@ -52,6 +52,10 @@ isofs.txt - info and mount options for the ISO 9660 (CDROM) filesystem. jfs.txt - info and mount options for the JFS filesystem. +locks.txt + - info on file locking implementations, flock() vs. fcntl(), etc. +mandatory-locking.txt + - info on the Linux implementation of Sys V mandatory file locking. ncpfs.txt - info on Novell Netware(tm) filesystem using NCP protocol. ntfs.txt diff --git a/Documentation/filesystems/Locking b/Documentation/filesystems/Locking index f0f825808ca4..fe26cc978523 100644 --- a/Documentation/filesystems/Locking +++ b/Documentation/filesystems/Locking @@ -178,15 +178,18 @@ prototypes: locking rules: All except set_page_dirty may block - BKL PageLocked(page) + BKL PageLocked(page) i_sem writepage: no yes, unlocks (see below) readpage: no yes, unlocks sync_page: no maybe writepages: no set_page_dirty no no readpages: no -prepare_write: no yes -commit_write: no yes +prepare_write: no yes yes +commit_write: no yes yes +write_begin: no locks the page yes +write_end: no yes, unlocks yes +perform_write: no n/a yes bmap: yes invalidatepage: no yes releasepage: no yes diff --git a/Documentation/filesystems/locks.txt b/Documentation/filesystems/locks.txt new file mode 100644 index 000000000000..fab857accbd6 --- /dev/null +++ b/Documentation/filesystems/locks.txt @@ -0,0 +1,67 @@ + File Locking Release Notes + + Andy Walker <andy@lysaker.kvaerner.no> + + 12 May 1997 + + +1. What's New? +-------------- + +1.1 Broken Flock Emulation +-------------------------- + +The old flock(2) emulation in the kernel was swapped for proper BSD +compatible flock(2) support in the 1.3.x series of kernels. With the +release of the 2.1.x kernel series, support for the old emulation has +been totally removed, so that we don't need to carry this baggage +forever. + +This should not cause problems for anybody, since everybody using a +2.1.x kernel should have updated their C library to a suitable version +anyway (see the file "Documentation/Changes".) + +1.2 Allow Mixed Locks Again +--------------------------- + +1.2.1 Typical Problems - Sendmail +--------------------------------- +Because sendmail was unable to use the old flock() emulation, many sendmail +installations use fcntl() instead of flock(). This is true of Slackware 3.0 +for example. This gave rise to some other subtle problems if sendmail was +configured to rebuild the alias file. Sendmail tried to lock the aliases.dir +file with fcntl() at the same time as the GDBM routines tried to lock this +file with flock(). With pre 1.3.96 kernels this could result in deadlocks that, +over time, or under a very heavy mail load, would eventually cause the kernel +to lock solid with deadlocked processes. + + +1.2.2 The Solution +------------------ +The solution I have chosen, after much experimentation and discussion, +is to make flock() and fcntl() locks oblivious to each other. Both can +exists, and neither will have any effect on the other. + +I wanted the two lock styles to be cooperative, but there were so many +race and deadlock conditions that the current solution was the only +practical one. It puts us in the same position as, for example, SunOS +4.1.x and several other commercial Unices. The only OS's that support +cooperative flock()/fcntl() are those that emulate flock() using +fcntl(), with all the problems that implies. + + +1.3 Mandatory Locking As A Mount Option +--------------------------------------- + +Mandatory locking, as described in 'Documentation/filesystems/mandatory.txt' +was prior to this release a general configuration option that was valid for +all mounted filesystems. This had a number of inherent dangers, not the +least of which was the ability to freeze an NFS server by asking it to read +a file for which a mandatory lock existed. + +From this release of the kernel, mandatory locking can be turned on and off +on a per-filesystem basis, using the mount options 'mand' and 'nomand'. +The default is to disallow mandatory locking. The intention is that +mandatory locking only be enabled on a local filesystem as the specific need +arises. + diff --git a/Documentation/filesystems/mandatory-locking.txt b/Documentation/filesystems/mandatory-locking.txt new file mode 100644 index 000000000000..0979d1d2ca8b --- /dev/null +++ b/Documentation/filesystems/mandatory-locking.txt @@ -0,0 +1,171 @@ + Mandatory File Locking For The Linux Operating System + + Andy Walker <andy@lysaker.kvaerner.no> + + 15 April 1996 + (Updated September 2007) + +0. Why you should avoid mandatory locking +----------------------------------------- + +The Linux implementation is prey to a number of difficult-to-fix race +conditions which in practice make it not dependable: + + - The write system call checks for a mandatory lock only once + at its start. It is therefore possible for a lock request to + be granted after this check but before the data is modified. + A process may then see file data change even while a mandatory + lock was held. + - Similarly, an exclusive lock may be granted on a file after + the kernel has decided to proceed with a read, but before the + read has actually completed, and the reading process may see + the file data in a state which should not have been visible + to it. + - Similar races make the claimed mutual exclusion between lock + and mmap similarly unreliable. + +1. What is mandatory locking? +------------------------------ + +Mandatory locking is kernel enforced file locking, as opposed to the more usual +cooperative file locking used to guarantee sequential access to files among +processes. File locks are applied using the flock() and fcntl() system calls +(and the lockf() library routine which is a wrapper around fcntl().) It is +normally a process' responsibility to check for locks on a file it wishes to +update, before applying its own lock, updating the file and unlocking it again. +The most commonly used example of this (and in the case of sendmail, the most +troublesome) is access to a user's mailbox. The mail user agent and the mail +transfer agent must guard against updating the mailbox at the same time, and +prevent reading the mailbox while it is being updated. + +In a perfect world all processes would use and honour a cooperative, or +"advisory" locking scheme. However, the world isn't perfect, and there's +a lot of poorly written code out there. + +In trying to address this problem, the designers of System V UNIX came up +with a "mandatory" locking scheme, whereby the operating system kernel would +block attempts by a process to write to a file that another process holds a +"read" -or- "shared" lock on, and block attempts to both read and write to a +file that a process holds a "write " -or- "exclusive" lock on. + +The System V mandatory locking scheme was intended to have as little impact as +possible on existing user code. The scheme is based on marking individual files +as candidates for mandatory locking, and using the existing fcntl()/lockf() +interface for applying locks just as if they were normal, advisory locks. + +Note 1: In saying "file" in the paragraphs above I am actually not telling +the whole truth. System V locking is based on fcntl(). The granularity of +fcntl() is such that it allows the locking of byte ranges in files, in addition +to entire files, so the mandatory locking rules also have byte level +granularity. + +Note 2: POSIX.1 does not specify any scheme for mandatory locking, despite +borrowing the fcntl() locking scheme from System V. The mandatory locking +scheme is defined by the System V Interface Definition (SVID) Version 3. + +2. Marking a file for mandatory locking +--------------------------------------- + +A file is marked as a candidate for mandatory locking by setting the group-id +bit in its file mode but removing the group-execute bit. This is an otherwise +meaningless combination, and was chosen by the System V implementors so as not +to break existing user programs. + +Note that the group-id bit is usually automatically cleared by the kernel when +a setgid file is written to. This is a security measure. The kernel has been +modified to recognize the special case of a mandatory lock candidate and to +refrain from clearing this bit. Similarly the kernel has been modified not +to run mandatory lock candidates with setgid privileges. + +3. Available implementations +---------------------------- + +I have considered the implementations of mandatory locking available with +SunOS 4.1.x, Solaris 2.x and HP-UX 9.x. + +Generally I have tried to make the most sense out of the behaviour exhibited +by these three reference systems. There are many anomalies. + +All the reference systems reject all calls to open() for a file on which +another process has outstanding mandatory locks. This is in direct +contravention of SVID 3, which states that only calls to open() with the +O_TRUNC flag set should be rejected. The Linux implementation follows the SVID +definition, which is the "Right Thing", since only calls with O_TRUNC can +modify the contents of the file. + +HP-UX even disallows open() with O_TRUNC for a file with advisory locks, not +just mandatory locks. That would appear to contravene POSIX.1. + +mmap() is another interesting case. All the operating systems mentioned +prevent mandatory locks from being applied to an mmap()'ed file, but HP-UX +also disallows advisory locks for such a file. SVID actually specifies the +paranoid HP-UX behaviour. + +In my opinion only MAP_SHARED mappings should be immune from locking, and then +only from mandatory locks - that is what is currently implemented. + +SunOS is so hopeless that it doesn't even honour the O_NONBLOCK flag for +mandatory locks, so reads and writes to locked files always block when they +should return EAGAIN. + +I'm afraid that this is such an esoteric area that the semantics described +below are just as valid as any others, so long as the main points seem to +agree. + +4. Semantics +------------ + +1. Mandatory locks can only be applied via the fcntl()/lockf() locking + interface - in other words the System V/POSIX interface. BSD style + locks using flock() never result in a mandatory lock. + +2. If a process has locked a region of a file with a mandatory read lock, then + other processes are permitted to read from that region. If any of these + processes attempts to write to the region it will block until the lock is + released, unless the process has opened the file with the O_NONBLOCK + flag in which case the system call will return immediately with the error + status EAGAIN. + +3. If a process has locked a region of a file with a mandatory write lock, all + attempts to read or write to that region block until the lock is released, + unless a process has opened the file with the O_NONBLOCK flag in which case + the system call will return immediately with the error status EAGAIN. + +4. Calls to open() with O_TRUNC, or to creat(), on a existing file that has + any mandatory locks owned by other processes will be rejected with the + error status EAGAIN. + +5. Attempts to apply a mandatory lock to a file that is memory mapped and + shared (via mmap() with MAP_SHARED) will be rejected with the error status + EAGAIN. + +6. Attempts to create a shared memory map of a file (via mmap() with MAP_SHARED) + that has any mandatory locks in effect will be rejected with the error status + EAGAIN. + +5. Which system calls are affected? +----------------------------------- + +Those which modify a file's contents, not just the inode. That gives read(), +write(), readv(), writev(), open(), creat(), mmap(), truncate() and +ftruncate(). truncate() and ftruncate() are considered to be "write" actions +for the purposes of mandatory locking. + +The affected region is usually defined as stretching from the current position +for the total number of bytes read or written. For the truncate calls it is +defined as the bytes of a file removed or added (we must also consider bytes +added, as a lock can specify just "the whole file", rather than a specific +range of bytes.) + +Note 3: I may have overlooked some system calls that need mandatory lock +checking in my eagerness to get this code out the door. Please let me know, or +better still fix the system calls yourself and submit a patch to me or Linus. + +6. Warning! +----------- + +Not even root can override a mandatory lock, so runaway processes can wreak +havoc if they lock crucial files. The way around it is to change the file +permissions (remove the setgid bit) before trying to read or write to it. +Of course, that might be a bit tricky if the system is hung :-( + diff --git a/Documentation/filesystems/ntfs.txt b/Documentation/filesystems/ntfs.txt index 8ee10ec88293..e79ee2db183a 100644 --- a/Documentation/filesystems/ntfs.txt +++ b/Documentation/filesystems/ntfs.txt @@ -407,7 +407,7 @@ raiddev /dev/md0 device /dev/hda5 raid-disk 0 device /dev/hdb1 - raid-disl 1 + raid-disk 1 For linear raid, just change the raid-level above to "raid-level linear", for mirrors, change it to "raid-level 1", and for stripe sets with parity, change @@ -457,6 +457,8 @@ ChangeLog Note, a technical ChangeLog aimed at kernel hackers is in fs/ntfs/ChangeLog. +2.1.29: + - Fix a deadlock when mounting read-write. 2.1.28: - Fix a deadlock. 2.1.27: diff --git a/Documentation/filesystems/vfs.txt b/Documentation/filesystems/vfs.txt index 045f3e055a28..6f8e16e3d6c0 100644 --- a/Documentation/filesystems/vfs.txt +++ b/Documentation/filesystems/vfs.txt @@ -537,6 +537,12 @@ struct address_space_operations { struct list_head *pages, unsigned nr_pages); int (*prepare_write)(struct file *, struct page *, unsigned, unsigned); int (*commit_write)(struct file *, struct page *, unsigned, unsigned); + int (*write_begin)(struct file *, struct address_space *mapping, + loff_t pos, unsigned len, unsigned flags, + struct page **pagep, void **fsdata); + int (*write_end)(struct file *, struct address_space *mapping, + loff_t pos, unsigned len, unsigned copied, + struct page *page, void *fsdata); sector_t (*bmap)(struct address_space *, sector_t); int (*invalidatepage) (struct page *, unsigned long); int (*releasepage) (struct page *, int); @@ -615,11 +621,7 @@ struct address_space_operations { any basic-blocks on storage, then those blocks should be pre-read (if they haven't been read already) so that the updated blocks can be written out properly. - The page will be locked. If prepare_write wants to unlock the - page it, like readpage, may do so and return - AOP_TRUNCATED_PAGE. - In this case the prepare_write will be retried one the lock is - regained. + The page will be locked. Note: the page _must not_ be marked uptodate in this function (or anywhere else) unless it actually is uptodate right now. As @@ -633,6 +635,45 @@ struct address_space_operations { operations. It should avoid returning an error if possible - errors should have been handled by prepare_write. + write_begin: This is intended as a replacement for prepare_write. The + key differences being that: + - it returns a locked page (in *pagep) rather than being + given a pre locked page; + - it must be able to cope with short writes (where the + length passed to write_begin is greater than the number + of bytes copied into the page). + + Called by the generic buffered write code to ask the filesystem to + prepare to write len bytes at the given offset in the file. The + address_space should check that the write will be able to complete, + by allocating space if necessary and doing any other internal + housekeeping. If the write will update parts of any basic-blocks on + storage, then those blocks should be pre-read (if they haven't been + read already) so that the updated blocks can be written out properly. + + The filesystem must return the locked pagecache page for the specified + offset, in *pagep, for the caller to write into. + + flags is a field for AOP_FLAG_xxx flags, described in + include/linux/fs.h. + + A void * may be returned in fsdata, which then gets passed into + write_end. + + Returns 0 on success; < 0 on failure (which is the error code), in + which case write_end is not called. + + write_end: After a successful write_begin, and data copy, write_end must + be called. len is the original len passed to write_begin, and copied + is the amount that was able to be copied (copied == len is always true + if write_begin was called with the AOP_FLAG_UNINTERRUPTIBLE flag). + + The filesystem must take care of unlocking the page and releasing it + refcount, and updating i_size. + + Returns < 0 on failure, otherwise the number of bytes (<= 'copied') + that were able to be copied into pagecache. + bmap: called by the VFS to map a logical block offset within object to physical block number. This method is used by the FIBMAP ioctl and for working with swap-files. To be able to swap to |