aboutsummaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)Author
2010-08-26Fix init ordering of /dev/console vs callers of modprobeDavid Howells
commit 31d1d48e199e99077fb30f6fb9a793be7bec756f upstream. Make /dev/console get initialised before any initialisation routine that invokes modprobe because if modprobe fails, it's going to want to open /dev/console, presumably to write an error message to. The problem with that is that if the /dev/console driver is not yet initialised, the chardev handler will call request_module() to invoke modprobe, which will fail, because we never compile /dev/console as a module. This will lead to a modprobe loop, showing the following in the kernel log: request_module: runaway loop modprobe char-major-5-1 request_module: runaway loop modprobe char-major-5-1 request_module: runaway loop modprobe char-major-5-1 request_module: runaway loop modprobe char-major-5-1 request_module: runaway loop modprobe char-major-5-1 This can happen, for example, when the built in md5 module can't find the built in cryptomgr module (because the latter fails to initialise). The md5 module comes before the call to tty_init(), presumably because 'crypto' comes before 'drivers' alphabetically. Fix this by calling tty_init() from chrdev_init(). Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2010-08-26NFS: Fix an Oops in the NFSv4 atomic open codeTrond Myklebust
commit 0a377cff9428af2da2b293d11e07bc4dbf064ee5 upstream. Adam Lackorzynski reports: with 2.6.35.2 I'm getting this reproducible Oops: [ 110.825396] BUG: unable to handle kernel NULL pointer dereference at (null) [ 110.828638] IP: [<ffffffff811247b7>] encode_attrs+0x1a/0x2a4 [ 110.828638] PGD be89f067 PUD bf18f067 PMD 0 [ 110.828638] Oops: 0000 [#1] SMP [ 110.828638] last sysfs file: /sys/class/net/lo/operstate [ 110.828638] CPU 2 [ 110.828638] Modules linked in: rtc_cmos rtc_core rtc_lib amd64_edac_mod i2c_amd756 edac_core i2c_core dm_mirror dm_region_hash dm_log dm_snapshot sg sr_mod usb_storage ohci_hcd mptspi tg3 mptscsih mptbase usbcore nls_base [last unloaded: scsi_wait_scan] [ 110.828638] [ 110.828638] Pid: 11264, comm: setchecksum Not tainted 2.6.35.2 #1 [ 110.828638] RIP: 0010:[<ffffffff811247b7>] [<ffffffff811247b7>] encode_attrs+0x1a/0x2a4 [ 110.828638] RSP: 0000:ffff88003bf5b878 EFLAGS: 00010296 [ 110.828638] RAX: ffff8800bddb48a8 RBX: ffff88003bf5bb18 RCX: 0000000000000000 [ 110.828638] RDX: ffff8800be258800 RSI: 0000000000000000 RDI: ffff88003bf5b9f8 [ 110.828638] RBP: 0000000000000000 R08: ffff8800bddb48a8 R09: 0000000000000004 [ 110.828638] R10: 0000000000000003 R11: ffff8800be779000 R12: ffff8800be258800 [ 110.828638] R13: ffff88003bf5b9f8 R14: ffff88003bf5bb20 R15: ffff8800be258800 [ 110.828638] FS: 0000000000000000(0000) GS:ffff880041e00000(0063) knlGS:00000000556bd6b0 [ 110.828638] CS: 0010 DS: 002b ES: 002b CR0: 000000008005003b [ 110.828638] CR2: 0000000000000000 CR3: 00000000be8ef000 CR4: 00000000000006e0 [ 110.828638] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 110.828638] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 [ 110.828638] Process setchecksum (pid: 11264, threadinfo ffff88003bf5a000, task ffff88003f232210) [ 110.828638] Stack: [ 110.828638] 0000000000000000 ffff8800bfbcf920 0000000000000000 0000000000000ffe [ 110.828638] <0> 0000000000000000 0000000000000000 0000000000000000 0000000000000000 [ 110.828638] <0> 0000000000000000 0000000000000000 0000000000000000 0000000000000000 [ 110.828638] Call Trace: [ 110.828638] [<ffffffff81124c1f>] ? nfs4_xdr_enc_setattr+0x90/0xb4 [ 110.828638] [<ffffffff81371161>] ? call_transmit+0x1c3/0x24a [ 110.828638] [<ffffffff813774d9>] ? __rpc_execute+0x78/0x22a [ 110.828638] [<ffffffff81371a91>] ? rpc_run_task+0x21/0x2b [ 110.828638] [<ffffffff81371b7e>] ? rpc_call_sync+0x3d/0x5d [ 110.828638] [<ffffffff8111e284>] ? _nfs4_do_setattr+0x11b/0x147 [ 110.828638] [<ffffffff81109466>] ? nfs_init_locked+0x0/0x32 [ 110.828638] [<ffffffff810ac521>] ? ifind+0x4e/0x90 [ 110.828638] [<ffffffff8111e2fb>] ? nfs4_do_setattr+0x4b/0x6e [ 110.828638] [<ffffffff8111e634>] ? nfs4_do_open+0x291/0x3a6 [ 110.828638] [<ffffffff8111ed81>] ? nfs4_open_revalidate+0x63/0x14a [ 110.828638] [<ffffffff811056c4>] ? nfs_open_revalidate+0xd7/0x161 [ 110.828638] [<ffffffff810a2de4>] ? do_lookup+0x1a4/0x201 [ 110.828638] [<ffffffff810a4733>] ? link_path_walk+0x6a/0x9d5 [ 110.828638] [<ffffffff810a42b6>] ? do_last+0x17b/0x58e [ 110.828638] [<ffffffff810a5fbe>] ? do_filp_open+0x1bd/0x56e [ 110.828638] [<ffffffff811cd5e0>] ? _atomic_dec_and_lock+0x30/0x48 [ 110.828638] [<ffffffff810a9b1b>] ? dput+0x37/0x152 [ 110.828638] [<ffffffff810ae063>] ? alloc_fd+0x69/0x10a [ 110.828638] [<ffffffff81099f39>] ? do_sys_open+0x56/0x100 [ 110.828638] [<ffffffff81027a22>] ? ia32_sysret+0x0/0x5 [ 110.828638] Code: 83 f1 01 e8 f5 ca ff ff 48 83 c4 50 5b 5d 41 5c c3 41 57 41 56 41 55 49 89 fd 41 54 49 89 d4 55 48 89 f5 53 48 81 ec 18 01 00 00 <8b> 06 89 c2 83 e2 08 83 fa 01 19 db 83 e3 f8 83 c3 18 a8 01 8d [ 110.828638] RIP [<ffffffff811247b7>] encode_attrs+0x1a/0x2a4 [ 110.828638] RSP <ffff88003bf5b878> [ 110.828638] CR2: 0000000000000000 [ 112.840396] ---[ end trace 95282e83fd77358f ]--- We need to ensure that the O_EXCL flag is turned off if the user doesn't set O_CREAT. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2010-08-26nfs: Add "lookupcache" to displayed mount optionsPatrick J. LoPresti
commit 9b00c64318cc337846a7a08a5678f5f19aeff188 upstream. Running "cat /proc/mounts" fails to display the "lookupcache" option. This oversight cost me a bunch of wasted time recently. The following simple patch fixes it. Signed-off-by: Patrick LoPresti <lopresti@gmail.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2010-08-26Fix the nested PR lock calling issue in ACLJiaju Zhang
commit 845b6cf34150100deb5f58c8a37a372b111f2918 upstream. Hi, Thanks a lot for all the review and comments so far;) I'd like to send the improved (V4) version of this patch. This patch fixes a deadlock in OCFS2 ACL. We found this bug in OCFS2 and Samba integration using scenario, the symptom is several smbd processes will be hung under heavy workload. Finally we found out it is the nested PR lock calling that leads to this deadlock: node1 node2 gr PR | V PR(EX)---> BAST:OCFS2_LOCK_BLOCKED | V rq PR | V wait=1 After requesting the 2nd PR lock, the process "smbd" went into D state. It can only be woken up when the 1st PR lock's RO holder equals zero. There should be an ocfs2_inode_unlock in the calling path later on, which can decrement the RO holder. But since it has been in uninterruptible sleep, the unlock function has no chance to be called. The related stack trace is: smbd D ffff8800013d0600 0 9522 5608 0x00000000 ffff88002ca7fb18 0000000000000282 ffff88002f964500 ffff88002ca7fa98 ffff8800013d0600 ffff88002ca7fae0 ffff88002f964340 ffff88002f964340 ffff88002ca7ffd8 ffff88002ca7ffd8 ffff88002f964340 ffff88002f964340 Call Trace: [<ffffffff80350425>] schedule_timeout+0x175/0x210 [<ffffffff8034f580>] wait_for_common+0xf0/0x210 [<ffffffffa03e12b9>] __ocfs2_cluster_lock+0x3b9/0xa90 [ocfs2] [<ffffffffa03e7665>] ocfs2_inode_lock_full_nested+0x255/0xdb0 [ocfs2] [<ffffffffa0446019>] ocfs2_get_acl+0x69/0x120 [ocfs2] [<ffffffffa0446368>] ocfs2_check_acl+0x28/0x80 [ocfs2] [<ffffffff800e3507>] acl_permission_check+0x57/0xb0 [<ffffffff800e357d>] generic_permission+0x1d/0xc0 [<ffffffffa03eecea>] ocfs2_permission+0x10a/0x1d0 [ocfs2] [<ffffffff800e3f65>] inode_permission+0x45/0x100 [<ffffffff800d86b3>] sys_chdir+0x53/0x90 [<ffffffff80007458>] system_call_fastpath+0x16/0x1b [<00007f34a4ef6927>] 0x7f34a4ef6927 For details, please see: https://bugzilla.novell.com/show_bug.cgi?id=614332 and http://oss.oracle.com/bugzilla/show_bug.cgi?id=1278 Signed-off-by: Jiaju Zhang <jjzhang@suse.de> Acked-by: Mark Fasheh <mfasheh@suse.com> Signed-off-by: Joel Becker <joel.becker@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2010-08-26nilfs2: fix list corruption after ifile creation failureRyusuke Konishi
commit af4e36318edb848fcc0a8d5f75000ca00cdc7595 upstream. If nilfs_attach_checkpoint() gets a memory allocation failure during creation of ifile, it will return without removing nilfs_sb_info struct from ns_supers list. When a concurrently mounted snapshot is unmounted or another new snapshot is mounted after that, this causes kernel oops as below: > BUG: unable to handle kernel NULL pointer dereference at (null) > IP: [<f83662ff>] nilfs_find_sbinfo+0x74/0xa4 [nilfs2] > *pde = 00000000 > Oops: 0000 [#1] SMP <snip> > Call Trace: > [<f835dc29>] ? nilfs_get_sb+0x165/0x532 [nilfs2] > [<c1173c87>] ? ida_get_new_above+0x16d/0x187 > [<c109a7f8>] ? alloc_vfsmnt+0x7e/0x10a > [<c1070790>] ? kstrdup+0x2c/0x40 > [<c1089041>] ? vfs_kern_mount+0x96/0x14e > [<c108913d>] ? do_kern_mount+0x32/0xbd > [<c109b331>] ? do_mount+0x642/0x6a1 > [<c101a415>] ? do_page_fault+0x0/0x2d1 > [<c1099c00>] ? copy_mount_options+0x80/0xe2 > [<c10705d8>] ? strndup_user+0x48/0x67 > [<c109b3f1>] ? sys_mount+0x61/0x90 > [<c10027cc>] ? sysenter_do_call+0x12/0x22 This fixes the problem. Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp> Tested-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2010-08-26ocfs2/dlm: remove potential deadlock -V3Wengang Wang
commit b11f1f1ab73fd358b1b734a9427744802202ba68 upstream. When we need to take both dlm_domain_lock and dlm->spinlock, we should take them in order of: dlm_domain_lock then dlm->spinlock. There is pathes disobey this order. That is calling dlm_lockres_put() with dlm->spinlock held in dlm_run_purge_list. dlm_lockres_put() calls dlm_put() at the ref and dlm_put() locks on dlm_domain_lock. Fix: Don't grab/put the dlm when the initialising/releasing lockres. That grab is not required because we don't call dlm_unregister_domain() based on refcount. Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com> Signed-off-by: Joel Becker <joel.becker@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2010-08-26ocfs2/dlm: avoid incorrect bit set in refmap on recovery masterWengang Wang
commit a524812b7eaa7783d7811198921100f079034e61 upstream. In the following situation, there remains an incorrect bit in refmap on the recovery master. Finally the recovery master will fail at purging the lockres due to the incorrect bit in refmap. 1) node A has no interest on lockres A any longer, so it is purging it. 2) the owner of lockres A is node B, so node A is sending de-ref message to node B. 3) at this time, node B crashed. node C becomes the recovery master. it recovers lockres A(because the master is the dead node B). 4) node A migrated lockres A to node C with a refbit there. 5) node A failed to send de-ref message to node B because it crashed. The failure is ignored. no other action is done for lockres A any more. For mormal, re-send the deref message to it to recovery master can fix it. Well, ignoring the failure of deref to the original master and not recovering the lockres to recovery master has the same effect. And the later is simpler. Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com> Acked-by: Srinivas Eeda <srinivas.eeda@oracle.com> Signed-off-by: Joel Becker <joel.becker@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2010-08-26ocfs2: Count more refcount records in file system fragmentation.Tao Ma
commit 8a2e70c40ff58f82dde67770e6623ca45f0cb0c8 upstream. The refcount record calculation in ocfs2_calc_refcount_meta_credits is too optimistic that we can always allocate contiguous clusters and handle an already existed refcount rec as a whole. Actually because of file system fragmentation, we may have the chance to split a refcount record into 3 parts during the transaction. So consider the worst case in record calculation. Signed-off-by: Tao Ma <tao.ma@oracle.com> Signed-off-by: Joel Becker <joel.becker@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2010-08-26ocfs2 fix o2dlm dlm run purgelist (rev 3)Srinivas Eeda
commit 7beaf243787f85a2ef9213ccf13ab4a243283fde upstream. This patch fixes two problems in dlm_run_purgelist 1. If a lockres is found to be in use, dlm_run_purgelist keeps trying to purge the same lockres instead of trying the next lockres. 2. When a lockres is found unused, dlm_run_purgelist releases lockres spinlock before setting DLM_LOCK_RES_DROPPING_REF and calls dlm_purge_lockres. spinlock is reacquired but in this window lockres can get reused. This leads to BUG. This patch modifies dlm_run_purgelist to skip lockres if it's in use and purge next lockres. It also sets DLM_LOCK_RES_DROPPING_REF before releasing the lockres spinlock protecting it from getting reused. Signed-off-by: Srinivas Eeda <srinivas.eeda@oracle.com> Acked-by: Sunil Mushran <sunil.mushran@oracle.com> Signed-off-by: Joel Becker <joel.becker@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2010-08-26ocfs2/dlm: fix a dead lockWengang Wang
commit 6d98c3ccb52f692f1a60339dde7c700686a5568b upstream. When we have to take both dlm->master_lock and lockres->spinlock, take them in order lockres->spinlock and then dlm->master_lock. The patch fixes a violation of the rule. We can simply move taking dlm->master_lock to where we have dropped res->spinlock since when we access res->state and free mle memory we don't need master_lock's protection. Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com> Signed-off-by: Joel Becker <joel.becker@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2010-08-26ocfs2: do not overwrite error codes in ocfs2_init_aclTiger Yang
commit 6eda3dd33f8a0ce58ee56a11351758643a698db4 upstream. Setting the acl while creating a new inode depends on the error codes of posix_acl_create_masq. This patch fix a issue of overwriting the error codes of it. Reported-by: Pawel Zawora <pzawora@gmail.com> Signed-off-by: Tiger Yang <tiger.yang@oracle.com> Signed-off-by: Joel Becker <joel.becker@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2010-08-20mm: fix up some user-visible effects of the stack guard pageLinus Torvalds
commit d7824370e26325c881b665350ce64fb0a4fde24a upstream. This commit makes the stack guard page somewhat less visible to user space. It does this by: - not showing the guard page in /proc/<pid>/maps It looks like lvm-tools will actually read /proc/self/maps to figure out where all its mappings are, and effectively do a specialized "mlockall()" in user space. By not showing the guard page as part of the mapping (by just adding PAGE_SIZE to the start for grows-up pages), lvm-tools ends up not being aware of it. - by also teaching the _real_ mlock() functionality not to try to lock the guard page. That would just expand the mapping down to create a new guard page, so there really is no point in trying to lock it in place. It would perhaps be nice to show the guard page specially in /proc/<pid>/maps (or at least mark grow-down segments some way), but let's not open ourselves up to more breakage by user space from programs that depends on the exact deails of the 'maps' file. Special thanks to Henrique de Moraes Holschuh for diving into lvm-tools source code to see what was going on with the whole new warning. Reported-and-tested-by: François Valenduc <francois.valenduc@tvcablenet.be Reported-by: Henrique de Moraes Holschuh <hmh@hmh.eng.br> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2010-08-13ext4: fix freeze deadlock under IOEric Sandeen
commit 437f88cc031ffe7f37f3e705367f4fe1f4be8b0f upstream. Commit 6b0310fbf087ad6 caused a regression resulting in deadlocks when freezing a filesystem which had active IO; the vfs_check_frozen level (SB_FREEZE_WRITE) did not let the freeze-related IO syncing through. Duh. Changing the test to FREEZE_TRANS should let the normal freeze syncing get through the fs, but still block any transactions from starting once the fs is completely frozen. I tested this by running fsstress in the background while periodically snapshotting the fs and running fsck on the result. I ran into occasional deadlocks, but different ones. I think this is a fine fix for the problem at hand, and the other deadlocky things will need more investigation. Reported-by: Phillip Susi <psusi@cfl.rr.com> Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2010-08-13jfs: don't allow os2 xattr namespace overlap with othersDave Kleikamp
commit aca0fa34bdaba39bfddddba8ca70dba4782e8fe6 upstream. It's currently possible to bypass xattr namespace access rules by prefixing valid xattr names with "os2.", since the os2 namespace stores extended attributes in a legacy format with no prefix. This patch adds checking to deny access to any valid namespace prefix following "os2.". Signed-off-by: Dave Kleikamp <shaggy@linux.vnet.ibm.com> Reported-by: Sergey Vlasov <vsu@altlinux.ru> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2010-08-13signalfd: fill in ssi_int for posix timers and message queuesNathan Lynch
commit a2a20c412c86e0bb46a9ab0dd31bcfe6d201b913 upstream. If signalfd is used to consume a signal generated by a POSIX interval timer or POSIX message queue, the ssi_int field does not reflect the data (sigevent->sigev_value) supplied to timer_create(2) or mq_notify(3). (The ssi_ptr field, however, is filled in.) This behavior differs from signalfd's treatment of sigqueue-generated signals -- see the default case in signalfd_copyinfo. It also gives results that differ from the case when a signal is handled conventionally via a sigaction-registered handler. So, set signalfd_siginfo->ssi_int in the remaining cases (__SI_TIMER, __SI_MESGQ) where ssi_ptr is set. akpm: a non-back-compatible change. Merge into -stable to minimise the number of kernels which are in the field and which miss this feature. Signed-off-by: Nathan Lynch <ntl@pobox.com> Acked-by: Davide Libenzi <davidel@xmailserver.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2010-08-13fs/ecryptfs/file.c: introduce missing freeJulia Lawall
commit ceeab92971e8af05c1e81a4ff2c271124b55bb9b upstream. The comments in the code indicate that file_info should be released if the function fails. This releasing is done at the label out_free, not out. The semantic match that finds this problem is as follows: (http://www.emn.fr/x-info/coccinelle/) // <smpl> @r exists@ local idexpression x; statement S; expression E; identifier f,f1,l; position p1,p2; expression *ptr != NULL; @@ x@p1 = kmem_cache_zalloc(...); ... if (x == NULL) S <... when != x when != if (...) { <+...x...+> } ( x->f1 = E | (x->f1 == NULL || ...) | f(...,x->f1,...) ) ...> ( return <+...x...+>; | return@p2 ...; ) @script:python@ p1 << r.p1; p2 << r.p2; @@ print "* file: %s kmem_cache_zalloc %s" % (p1[0].file,p1[0].line) // </smpl> Signed-off-by: Julia Lawall <julia@diku.dk> Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2010-08-13ecryptfs: release reference to lower mount if interpose failsLino Sanfilippo
commit 31f73bee3e170b7cabb35db9e2f4bf7919b9d036 upstream. In ecryptfs_lookup_and_interpose_lower() the lower mount is not decremented if allocation of a dentry info struct failed. As a result the lower filesystem cant be unmounted any more (since it is considered busy). This patch corrects the reference counting. Signed-off-by: Lino Sanfilippo <LinoSanfilippo@gmx.de> Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2010-08-13eCryptfs: Handle ioctl calls with unlocked and compat functionsTyler Hicks
commit c43f7b8fb03be8bcc579bfc4e6ab70eac887ab55 upstream. Lower filesystems that only implemented unlocked_ioctl weren't being passed ioctl calls because eCryptfs only checked for lower_file->f_op->ioctl and returned -ENOTTY if it was NULL. eCryptfs shouldn't implement ioctl(), since it doesn't require the BKL. This patch introduces ecryptfs_unlocked_ioctl() and ecryptfs_compat_ioctl(), which passes the calls on to the lower file system. https://bugs.launchpad.net/ecryptfs/+bug/469664 Reported-by: James Dupin <james.dupin@gmail.com> Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2010-08-13blkdev: cgroup whitelist permission fixChris Wright
commit b7300b78d1a87625975a799a109a2f98d77757c8 upstream. The cgroup device whitelist code gets confused when trying to grant permission to a disk partition that is not currently open. Part of blkdev_open() includes __blkdev_get() on the whole disk. Basically, the only ways to reliably allow a cgroup access to a partition on a block device when using the whitelist are to 1) also give it access to the whole block device or 2) make sure the partition is already open in a different context. The patch avoids the cgroup check for the whole disk case when opening a partition. Addresses https://bugzilla.redhat.com/show_bug.cgi?id=589662 Signed-off-by: Chris Wright <chrisw@sous-sol.org> Acked-by: Serge E. Hallyn <serue@us.ibm.com> Tested-by: Serge E. Hallyn <serue@us.ibm.com> Reported-by: Vivek Goyal <vgoyal@redhat.com> Cc: Al Viro <viro@ZenIV.linux.org.uk> Cc: Christoph Hellwig <hch@infradead.org> Cc: "Daniel P. Berrange" <berrange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2010-08-13Fix sget() race with failing mountAl Viro
commit 7a4dec53897ecd3367efb1e12fe8a4edc47dc0e9 upstream. If sget() finds a matching superblock being set up, it'll grab an active reference to it and grab s_umount. That's fine - we'll wait for completion of foofs_get_sb() that way. However, if said foofs_get_sb() fails we'll end up holding the halfway-created superblock. deactivate_locked_super() called by foofs_get_sb() will just unlock the sucker since we are holding another active reference to it. What we need is a way to tell if superblock has been successfully set up. Unfortunately, neither ->s_root nor the check for MS_ACTIVE quite fit. Cheap and easy way, suitable for backport: new flag set by the (only) caller of ->get_sb(). If that flag isn't present by the time sget() grabbed s_umount on preexisting superblock it has found, it's seeing a stillborn and should just bury it with deactivate_locked_super() (and repeat the search). Longer term we want to set that flag in ->get_sb() instances (and check for it to distinguish between "sget() found us a live sb" and "sget() has allocated an sb, we need to set it up" in there, instead of checking ->s_root as we do now). Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2010-08-10block_dev: always serialize exclusive open attemptsTejun Heo
commit e75aa85892b2ee78c79edac720868cbef16e62eb upstream. bd_prepare_to_claim() incorrectly allowed multiple attempts for exclusive open to progress in parallel if the attempting holders are identical. This triggered BUG_ON() as reported in the following bug. https://bugzilla.kernel.org/show_bug.cgi?id=16393 __bd_abort_claiming() is used to finish claiming blocks and doesn't work if multiple openers are inside a claiming block. Allowing multiple parallel open attempts to continue doesn't gain anything as those are serialized down in the call chain anyway. Fix it by always allowing only single open attempt in a claiming block. This problem can easily be reproduced by adding a delay after bd_prepare_to_claim() and attempting to mount two partitions of a disk. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Markus Trippelsdorf <markus@trippelsdorf.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2010-08-01NFS: Fix a typo in include/linux/nfs_fs.hTrond Myklebust
nfs_commit_inode() needs to be defined irrespectively of whether or not we are supporting NFSv3 and NFSv4. Allow the compiler to optimise away code in the NFSv2-only case by converting it into an inlined stub function. Reported-and-tested-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-07-30Merge branch 'bugfixes' of git://git.linux-nfs.org/projects/trondmy/nfs-2.6Linus Torvalds
* 'bugfixes' of git://git.linux-nfs.org/projects/trondmy/nfs-2.6: NFS: Ensure that writepage respects the nonblock flag NFS: kswapd must not block in nfs_release_page nfs: include space for the NUL in root path
2010-07-30CIFS: Remove __exit mark from cifs_exit_dns_resolver()David Howells
Remove the __exit mark from cifs_exit_dns_resolver() as it's called by the module init routine in case of error, and so may have been discarded during linkage. Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-07-30NFS: Ensure that writepage respects the nonblock flagTrond Myklebust
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2010-07-30NFS: kswapd must not block in nfs_release_pageTrond Myklebust
See https://bugzilla.kernel.org/show_bug.cgi?id=16056 If other processes are blocked waiting for kswapd to free up some memory so that they can make progress, then we cannot allow kswapd to block on those processes. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> Cc: stable@kernel.org
2010-07-30nfs: include space for the NUL in root pathDan Carpenter
In root_nfs_name() it does the following: if (strlen(buf) + strlen(cp) > NFS_MAXPATHLEN) { printk(KERN_ERR "Root-NFS: Pathname for remote directory too long.\n"); return -1; } sprintf(nfs_export_path, buf, cp); In the original code if (strlen(buf) + strlen(cp) == NFS_MAXPATHLEN) then the sprintf() would lead to an overflow. Generally the rest of the code assumes that the path can have NFS_MAXPATHLEN (1024) characters and a NUL terminator so the fix is to add space to the nfs_export_path[] buffer. Signed-off-by: Dan Carpenter <error27@gmail.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2010-07-29CRED: Fix get_task_cred() and task_state() to not resurrect dead credentialsDavid Howells
It's possible for get_task_cred() as it currently stands to 'corrupt' a set of credentials by incrementing their usage count after their replacement by the task being accessed. What happens is that get_task_cred() can race with commit_creds(): TASK_1 TASK_2 RCU_CLEANER -->get_task_cred(TASK_2) rcu_read_lock() __cred = __task_cred(TASK_2) -->commit_creds() old_cred = TASK_2->real_cred TASK_2->real_cred = ... put_cred(old_cred) call_rcu(old_cred) [__cred->usage == 0] get_cred(__cred) [__cred->usage == 1] rcu_read_unlock() -->put_cred_rcu() [__cred->usage == 1] panic() However, since a tasks credentials are generally not changed very often, we can reasonably make use of a loop involving reading the creds pointer and using atomic_inc_not_zero() to attempt to increment it if it hasn't already hit zero. If successful, we can safely return the credentials in the knowledge that, even if the task we're accessing has released them, they haven't gone to the RCU cleanup code. We then change task_state() in procfs to use get_task_cred() rather than calling get_cred() on the result of __task_cred(), as that suffers from the same problem. Without this change, a BUG_ON in __put_cred() or in put_cred_rcu() can be tripped when it is noticed that the usage count is not zero as it ought to be, for example: kernel BUG at kernel/cred.c:168! invalid opcode: 0000 [#1] SMP last sysfs file: /sys/kernel/mm/ksm/run CPU 0 Pid: 2436, comm: master Not tainted 2.6.33.3-85.fc13.x86_64 #1 0HR330/OptiPlex 745 RIP: 0010:[<ffffffff81069881>] [<ffffffff81069881>] __put_cred+0xc/0x45 RSP: 0018:ffff88019e7e9eb8 EFLAGS: 00010202 RAX: 0000000000000001 RBX: ffff880161514480 RCX: 00000000ffffffff RDX: 00000000ffffffff RSI: ffff880140c690c0 RDI: ffff880140c690c0 RBP: ffff88019e7e9eb8 R08: 00000000000000d0 R09: 0000000000000000 R10: 0000000000000001 R11: 0000000000000040 R12: ffff880140c690c0 R13: ffff88019e77aea0 R14: 00007fff336b0a5c R15: 0000000000000001 FS: 00007f12f50d97c0(0000) GS:ffff880007400000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f8f461bc000 CR3: 00000001b26ce000 CR4: 00000000000006f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process master (pid: 2436, threadinfo ffff88019e7e8000, task ffff88019e77aea0) Stack: ffff88019e7e9ec8 ffffffff810698cd ffff88019e7e9ef8 ffffffff81069b45 <0> ffff880161514180 ffff880161514480 ffff880161514180 0000000000000000 <0> ffff88019e7e9f28 ffffffff8106aace 0000000000000001 0000000000000246 Call Trace: [<ffffffff810698cd>] put_cred+0x13/0x15 [<ffffffff81069b45>] commit_creds+0x16b/0x175 [<ffffffff8106aace>] set_current_groups+0x47/0x4e [<ffffffff8106ac89>] sys_setgroups+0xf6/0x105 [<ffffffff81009b02>] system_call_fastpath+0x16/0x1b Code: 48 8d 71 ff e8 7e 4e 15 00 85 c0 78 0b 8b 75 ec 48 89 df e8 ef 4a 15 00 48 83 c4 18 5b c9 c3 55 8b 07 8b 07 48 89 e5 85 c0 74 04 <0f> 0b eb fe 65 48 8b 04 25 00 cc 00 00 48 3b b8 58 04 00 00 75 RIP [<ffffffff81069881>] __put_cred+0xc/0x45 RSP <ffff88019e7e9eb8> ---[ end trace df391256a100ebdd ]--- Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Jiri Olsa <jolsa@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-07-28ecryptfs: Bugfix for error related to ecryptfs_hash_bucketsAndre Osterhues
The function ecryptfs_uid_hash wrongly assumes that the second parameter to hash_long() is the number of hash buckets instead of the number of hash bits. This patch fixes that and renames the variable ecryptfs_hash_buckets to ecryptfs_hash_bits to make it clearer. Fixes: CVE-2010-2492 Signed-off-by: Andre Osterhues <aosterhues@escrypt.com> Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-07-28Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: ceph: use complete_all and wake_up_all ceph: Correct obvious typo of Kconfig variable "CRYPTO_AES" ceph: fix dentry lease release ceph: fix leak of dentry in ceph_init_dentry() error path ceph: fix pg_mapping leak on pg_temp updates ceph: fix d_release dop for snapdir, snapped dentries ceph: avoid dcache readdir for snapdir
2010-07-28GFS2: Use kmalloc when possible for ->readdir()Steven Whitehouse
If we don't need a huge amount of memory in ->readdir() then we can use kmalloc rather than vmalloc to allocate it. This should cut down on the greater overheads associated with vmalloc for smaller directories. We may be able to eliminate vmalloc entirely at some stage, but this is easy to do right away. Also using GFP_NOFS to avoid any issues wrt to deleting inodes while under a glock, and suggestion from Linus to factor out the alloc/dealloc. I've given this a test with a variety of different sized directories and it seems to work ok. Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Nick Piggin <npiggin@suse.de> Cc: Prarit Bhargava <prarit@redhat.com> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-07-27ceph: use complete_all and wake_up_allYehuda Sadeh
This fixes an issue triggered by running concurrent syncs. One of the syncs would go through while the other would just hang indefinitely. In any case, we never actually want to wake a single waiter, so the *_all functions should be used. Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net> Signed-off-by: Sage Weil <sage@newdream.net>
2010-07-279p: Pass the correct end of buffer to p9stat_readLatchesar Ionkov
Pass the correct end of the buffer to p9stat_read. Signed-off-by: Latchesar Ionkov <lucho@ionkov.net> Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
2010-07-26sysfs: allow creating symlinks from untagged to tagged directoriesEric W. Biederman
Supporting symlinks from untagged to tagged directories is reasonable, and needed to support CONFIG_SYSFS_DEPRECATED. So don't fail a prior allowing that case to work. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2010-07-26sysfs: sysfs_delete_link handle symlinks from untagged to tagged directories.Eric W. Biederman
This happens for network devices when SYSFS_DEPRECATED is enabled. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2010-07-26sysfs: Don't allow the creation of symlinks we can't removeEric W. Biederman
Recently my tagged sysfs support revealed a flaw in the device core that a few rare drivers are running into such that we don't always put network devices in a class subdirectory named net/. Since we are not creating the class directory the network devices wind up in a non-tagged directory, but the symlinks to the network devices from /sys/class/net are in a tagged directory. All of which works until we go to remove or rename the symlink. When we remove or rename a symlink we look in the namespace of the target of the symlink. Since the target of the symlink is in a non-tagged sysfs directory we don't have a namespace to look in, and we fail to remove the symlink. Detect this problem up front and simply don't create symlinks we won't be able to remove later. This prevents symlink leakage and fails in a much clearer and more understandable way. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Rafael J. Wysocki <rjw@sisk.pl> Cc: Maciej W. Rozycki <macro@linux-mips.org> Cc: Kay Sievers <kay.sievers@vrfy.org> Cc: Johannes Berg <johannes@sipsolutions.net> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2010-07-24ceph: Correct obvious typo of Kconfig variable "CRYPTO_AES"Robert P. J. Day
Signed-off-by: Robert P. J. Day <rpjday@crashcourse.ca> Signed-off-by: Sage Weil <sage@newdream.net>
2010-07-23ceph: fix dentry lease releaseSage Weil
When we embed a dentry lease release notification in a request, invalidate our lease so we don't think we still have it. Otherwise we can get all sorts of incorrect client behavior when multiple clients are interacting with the same part of the namespace. Signed-off-by: Sage Weil <sage@newdream.net>
2010-07-23ceph: fix leak of dentry in ceph_init_dentry() error pathSage Weil
If we fail to allocate a ceph_dentry_info, don't leak the dn reference. Signed-off-by: Sage Weil <sage@newdream.net>
2010-07-23ceph: fix pg_mapping leak on pg_temp updatesSage Weil
Free the ceph_pg_mapping structs when they are removed from the pg_temp rbtree. Also fix a leak in the __insert_pg_mapping() error path. Signed-off-by: Sage Weil <sage@newdream.net>
2010-07-23ceph: fix d_release dop for snapdir, snapped dentriesSage Weil
We need to set the d_release dop for snapdir and snapped dentries so that the ceph_dentry_info struct gets released. We also use the dcache to cache readdir results when possible, which only works if we know when dentries are dropped from the cache. Since we don't use the dcache for readdir in the hidden snapdir, avoid that case in ceph_dentry_release. Signed-off-by: Sage Weil <sage@newdream.net>
2010-07-22ceph: avoid dcache readdir for snapdirSage Weil
We should always go to the MDS for readdir on the hidden snapdir. The set of snapshots can change at any time; the client can't trust its cache for that. Signed-off-by: Sage Weil <sage@newdream.net>
2010-07-22CIFS: Fix a malicious redirect problem in the DNS lookup codeDavid Howells
Fix the security problem in the CIFS filesystem DNS lookup code in which a malicious redirect could be installed by a random user by simply adding a result record into one of their keyrings with add_key() and then invoking a CIFS CFS lookup [CVE-2010-2524]. This is done by creating an internal keyring specifically for the caching of DNS lookups. To enforce the use of this keyring, the module init routine creates a set of override credentials with the keyring installed as the thread keyring and instructs request_key() to only install lookup result keys in that keyring. The override is then applied around the call to request_key(). This has some additional benefits when a kernel service uses this module to request a key: (1) The result keys are owned by root, not the user that caused the lookup. (2) The result keys don't pop up in the user's keyrings. (3) The result keys don't come out of the quota of the user that caused the lookup. The keyring can be viewed as root by doing cat /proc/keys: 2a0ca6c3 I----- 1 perm 1f030000 0 0 keyring .dns_resolver: 1/4 It can then be listed with 'keyctl list' by root. # keyctl list 0x2a0ca6c3 1 key in keyring: 726766307: --alswrv 0 0 dns_resolver: foo.bar.com Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-and-Tested-by: Jeff Layton <jlayton@redhat.com> Acked-by: Steve French <smfrench@gmail.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-07-21Fix up trivial spelling errors ('taht' -> 'that')Linus Torvalds
Pointed out by Lucas who found the new one in a comment in setup_percpu.c. And then I fixed the others that I grepped for. Reported-by: Lucas <canolucas@gmail.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-07-20Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: ceph: do not include cap/dentry releases in replayed messages ceph: reuse request message when replaying against recovering mds ceph: fix creation of ipv6 sockets ceph: fix parsing of ipv6 addresses ceph: fix printing of ipv6 addrs ceph: add kfree() to error path ceph: fix leak of mon authorizer ceph: fix message revocation
2010-07-19Merge branch 'shrinker' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/dgc/xfsdev * 'shrinker' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/xfsdev: xfs: track AGs with reclaimable inodes in per-ag radix tree xfs: convert inode shrinker to per-filesystem contexts mm: add context argument to shrinker callback
2010-07-19Merge git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstableLinus Torvalds
* git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable: Btrfs: fix checks in BTRFS_IOC_CLONE_RANGE Btrfs: fix CLONE ioctl destination file size expansion to block boundary Btrfs: fix split_leaf double split corner case
2010-07-20xfs: track AGs with reclaimable inodes in per-ag radix treeDave Chinner
https://bugzilla.kernel.org/show_bug.cgi?id=16348 When the filesystem grows to a large number of allocation groups, the summing of recalimable inodes gets expensive. In many cases, most AGs won't have any reclaimable inodes and so we are wasting CPU time aggregating over these AGs. This is particularly important for the inode shrinker that gets called frequently under memory pressure. To avoid the overhead, track AGs with reclaimable inodes in the per-ag radix tree so that we can find all the AGs with reclaimable inodes via a simple gang tag lookup. This involves setting the tag when the first reclaimable inode is tracked in the AG, and removing the tag when the last reclaimable inode is removed from the tree. Then the summation process becomes a loop walking the radix tree summing AGs with the reclaim tag set. This significantly reduces the overhead of scanning - a 6400 AG filesystea now only uses about 25% of a cpu in kswapd while slab reclaim progresses instead of being permanently stuck at 100% CPU and making little progress. Clean filesystems filesystems will see no overhead and the overhead only increases linearly with the number of dirty AGs. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2010-07-20xfs: convert inode shrinker to per-filesystem contextsDave Chinner
Now the shrinker passes us a context, wire up a shrinker context per filesystem. This allows us to remove the global mount list and the locking problems that introduced. It also means that a shrinker call does not need to traverse clean filesystems before finding a filesystem with reclaimable inodes. This significantly reduces scanning overhead when lots of filesystems are present. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2010-07-19Btrfs: fix checks in BTRFS_IOC_CLONE_RANGEDan Rosenberg
1. The BTRFS_IOC_CLONE and BTRFS_IOC_CLONE_RANGE ioctls should check whether the donor file is append-only before writing to it. 2. The BTRFS_IOC_CLONE_RANGE ioctl appears to have an integer overflow that allows a user to specify an out-of-bounds range to copy from the source file (if off + len wraps around). I haven't been able to successfully exploit this, but I'd imagine that a clever attacker could use this to read things he shouldn't. Even if it's not exploitable, it couldn't hurt to be safe. Signed-off-by: Dan Rosenberg <dan.j.rosenberg@gmail.com> cc: stable@kernel.org Signed-off-by: Chris Mason <chris.mason@oracle.com>