aboutsummaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)Author
2015-05-17coredump: accept any write methodAl Viro
commit 86cc05840a0da1afcb6b8151b53f3b606457c91b upstream. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-05-17path_openat(): fix double fput()Al Viro
commit f15133df088ecadd141ea1907f2c96df67c729f0 upstream. path_openat() jumps to the wrong place after do_tmpfile() - it has already done path_cleanup() (as part of path_lookupat() called by do_tmpfile()), so doing that again can lead to double fput(). Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-05-17mnt: Fix fs_fully_visible to verify the root directory is visibleEric W. Biederman
commit 7e96c1b0e0f495c5a7450dc4aa7c9a24ba4305bd upstream. This fixes a dumb bug in fs_fully_visible that allows proc or sys to be mounted if there is a bind mount of part of /proc/ or /sys/ visible. Reported-by: Eric Windisch <ewindisch@docker.com> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-05-17nilfs2: fix sanity check of btree level in nilfs_btree_root_broken()Ryusuke Konishi
commit d8fd150fe3935e1692bf57c66691e17409ebb9c1 upstream. The range check for b-tree level parameter in nilfs_btree_root_broken() is wrong; it accepts the case of "level == NILFS_BTREE_LEVEL_MAX" even though the level is limited to values in the range of 0 to (NILFS_BTREE_LEVEL_MAX - 1). Since the level parameter is read from storage device and used to index nilfs_btree_path array whose element count is NILFS_BTREE_LEVEL_MAX, it can cause memory overrun during btree operations if the boundary value is set to the level parameter on device. This fixes the broken sanity check and adds a comment to clarify that the upper bound NILFS_BTREE_LEVEL_MAX is exclusive. Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-05-17ocfs2: dlm: fix race between purge and get lock resourceJunxiao Bi
commit b1432a2a35565f538586774a03bf277c27fc267d upstream. There is a race window in dlm_get_lock_resource(), which may return a lock resource which has been purged. This will cause the process to hang forever in dlmlock() as the ast msg can't be handled due to its lock resource not existing. dlm_get_lock_resource { ... spin_lock(&dlm->spinlock); tmpres = __dlm_lookup_lockres_full(dlm, lockid, namelen, hash); if (tmpres) { spin_unlock(&dlm->spinlock); >>>>>>>> race window, dlm_run_purge_list() may run and purge the lock resource spin_lock(&tmpres->spinlock); ... spin_unlock(&tmpres->spinlock); } } Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com> Cc: Joseph Qi <joseph.qi@huawei.com> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-05-13hfsplus: don't store special "osx" xattr prefix on-diskThomas Hebb
commit db579e76f06e78de011b2cb7e028740a82f5558c upstream. On Mac OS X, HFS+ extended attributes are not namespaced. Since we want to be compatible with OS X filesystems and yet still support the Linux namespacing system, the hfsplus driver implements a special "osx" namespace that is reported for any attribute that is not namespaced on-disk. However, the current code for getting and setting these unprefixed attributes is broken. hfsplus_osx_setattr() and hfsplus_osx_getattr() are passed names that have already had their "osx." prefixes stripped by the generic functions. The functions first, quite correctly, check those names to make sure that they aren't prefixed with a known namespace, which would allow namespace access restrictions to be bypassed. However, the functions then prepend "osx." to the name they're given before passing it on to hfsplus_getattr() and hfsplus_setattr(). Not only does this cause the "osx." prefix to be stored on-disk, defeating its purpose, it also breaks the check for the special "com.apple.FinderInfo" attribute, which is reported for all files, and as a consequence makes some userspace applications (e.g. GNU patch) fail even when extended attributes are not otherwise in use. There are five commits which have touched this particular code: 127e5f5ae51e ("hfsplus: rework functionality of getting, setting and deleting of extended attributes") b168fff72109 ("hfsplus: use xattr handlers for removexattr") bf29e886b242 ("hfsplus: correct usage of HFSPLUS_ATTR_MAX_STRLEN for non-English attributes") fcacbd95e121 ("fs/hfsplus: move xattr_name allocation in hfsplus_getxattr()") ec1bbd346f18 ("fs/hfsplus: move xattr_name allocation in hfsplus_setxattr()") The first commit creates the functions to begin with. The namespace is prepended by the original code, which I believe was correct at the time, since hfsplus_?etattr() stripped the prefix if found. The second commit removes this behavior from hfsplus_?etattr() and appears to have been intended to also remove the prefixing from hfsplus_osx_?etattr(). However, what it actually does is remove a necessary strncpy() call completely, breaking the osx namespace entirely. The third commit re-adds the strncpy() call as it was originally, but doesn't mention it in its commit message. The final two commits refactor the code and don't affect its functionality. This commit does what b168fff attempted to do (prevent the prefix from being added), but does it properly, instead of passing in an empty buffer (which is what b168fff actually did). Fixes: b168fff72109 ("hfsplus: use xattr handlers for removexattr") Signed-off-by: Thomas Hebb <tommyhebb@gmail.com> Cc: Hin-Tak Leung <htl10@users.sourceforge.net> Cc: Sergei Antonov <saproj@gmail.com> Cc: Anton Altaparmakov <anton@tuxera.com> Cc: Fabian Frederick <fabf@skynet.be> Cc: Christian Kujau <lists@nerdbynature.de> Cc: Christoph Hellwig <hch@infradead.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Viacheslav Dubeyko <slava@dubeyko.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Thomas Hebb <tommyhebb@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-05-13ext4: move check under lock scope to close a race.Davide Italiano
commit 280227a75b56ab5d35854f3a77ef74a7ad56a203 upstream. fallocate() checks that the file is extent-based and returns EOPNOTSUPP in case is not. Other tasks can convert from and to indirect and extent so it's safe to check only after grabbing the inode mutex. Signed-off-by: Davide Italiano <dccitaliano@gmail.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-05-13ext4: fix data corruption caused by unwritten and delayed extentsLukas Czerner
commit d2dc317d564a46dfc683978a2e5a4f91434e9711 upstream. Currently it is possible to lose whole file system block worth of data when we hit the specific interaction with unwritten and delayed extents in status extent tree. The problem is that when we insert delayed extent into extent status tree the only way to get rid of it is when we write out delayed buffer. However there is a limitation in the extent status tree implementation so that when inserting unwritten extent should there be even a single delayed block the whole unwritten extent would be marked as delayed. At this point, there is no way to get rid of the delayed extents, because there are no delayed buffers to write out. So when a we write into said unwritten extent we will convert it to written, but it still remains delayed. When we try to write into that block later ext4_da_map_blocks() will set the buffer new and delayed and map it to invalid block which causes the rest of the block to be zeroed loosing already written data. For now we can fix this by simply not allowing to set delayed status on written extent in the extent status tree. Also add WARN_ON() to make sure that we notice if this happens in the future. This problem can be easily reproduced by running the following xfs_io. xfs_io -f -c "pwrite -S 0xaa 4096 2048" \ -c "falloc 0 131072" \ -c "pwrite -S 0xbb 65536 2048" \ -c "fsync" /mnt/test/fff echo 3 > /proc/sys/vm/drop_caches xfs_io -c "pwrite -S 0xdd 67584 2048" /mnt/test/fff This can be theoretically also reproduced by at random by running fsx, but it's not very reliable, though on machines with bigger page size (like ppc) this can be seen more often (especially xfstest generic/127) Signed-off-by: Lukas Czerner <lczerner@redhat.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-05-13btrfs: unlock i_mutex after attempting to delete subvolume during sendOmar Sandoval
commit 909e26dce3f7600f5e293ac0522c28790a0c8c9c upstream. Whenever the check for a send in progress introduced in commit 521e0546c970 (btrfs: protect snapshots from deleting during send) is hit, we return without unlocking inode->i_mutex. This is easy to see with lockdep enabled: [ +0.000059] ================================================ [ +0.000028] [ BUG: lock held when returning to user space! ] [ +0.000029] 4.0.0-rc5-00096-g3c435c1 #93 Not tainted [ +0.000026] ------------------------------------------------ [ +0.000029] btrfs/211 is leaving the kernel with locks still held! [ +0.000029] 1 lock held by btrfs/211: [ +0.000023] #0: (&type->i_mutex_dir_key){+.+.+.}, at: [<ffffffff8135b8df>] btrfs_ioctl_snap_destroy+0x2df/0x7a0 Make sure we unlock it in the error path. Reviewed-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Omar Sandoval <osandov@osandov.com> Signed-off-by: Chris Mason <clm@fb.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-05-06NFS: Add a stub for GETDEVICELISTAnna Schumaker
commit 7c61f0d3897eeeff6f3294adb9f910ddefa8035a upstream. d4b18c3e (pnfs: remove GETDEVICELIST implementation) removed the GETDEVICELIST operation from the NFS client, but left a "hole" in the nfs4_procedures array. This caused /proc/self/mountstats to report an operation named "51" where GETDEVICELIST used to be. This patch adds a stub to fix mountstats. Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com> Fixes: d4b18c3e (pnfs: remove GETDEVICELIST implementation) Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-05-06nfs: remove WARN_ON_ONCE from nfs_direct_good_bytesPeng Tao
commit 05f54903d9d370a4cd302a85681304d3ec59e5c1 upstream. For flexfiles driver, we might choose to read from mirror index other than 0 while mirror_count is always 1 for read. Reported-by: Jean Spector <jean@primarydata.com> Cc: Weston Andros Adamson <dros@primarydata.com> Signed-off-by: Peng Tao <tao.peng@primarydata.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-05-06nfs: fix DIO good bytes calculationPeng Tao
commit 1ccbad9f9f9bd36db26a10f0b17fbaf12b3ae93a upstream. For direct read that has IO size larger than rsize, we'll split it into several READ requests and nfs_direct_good_bytes() would count completed bytes incorrectly by eating last zero count reply. Fix it by handling mirror and non-mirror cases differently such that we only count mirrored writes differently. This fixes 5fadeb47("nfs: count DIO good bytes correctly with mirroring"). Reported-by: Jean Spector <jean@primarydata.com> Signed-off-by: Peng Tao <tao.peng@primarydata.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-05-06nfs: fix high load average due to callback thread sleepingJeff Layton
commit 5d05e54af3cdbb13cf19c557ff2184781b91a22c upstream. Chuck pointed out a problem that crept in with commit 6ffa30d3f734 (nfs: don't call blocking operations while !TASK_RUNNING). Linux counts tasks in uninterruptible sleep against the load average, so this caused the system's load average to be pinned at at least 1 when there was a NFSv4.1+ mount active. Not a huge problem, but it's probably worth fixing before we get too many complaints about it. This patch converts the code back to use TASK_INTERRUPTIBLE sleep, simply has it flush any signals on each loop iteration. In practice no one should really be signalling this thread at all, so I think this is reasonably safe. With this change, there's also no need to game the hung task watchdog so we can also convert the schedule_timeout call back to a normal schedule. Reported-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Jeff Layton <jeff.layton@primarydata.com> Tested-by: Chuck Lever <chuck.lever@oracle.com> Fixes: commit 6ffa30d3f734 (“nfs: don't call blocking . . .”) Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-05-06nfsd: fix nsfd startup race triggering BUG_ONGiuseppe Cantavenera
commit bb7ffbf29e76b89a86ca4c3ee0d4690641f2f772 upstream. nfsd triggered a BUG_ON in net_generic(...) when rpc_pipefs_event(...) in fs/nfsd/nfs4recover.c was called before assigning ntfsd_net_id. The following was observed on a MIPS 32-core processor: kernel: Call Trace: kernel: [<ffffffffc00bc5e4>] rpc_pipefs_event+0x7c/0x158 [nfsd] kernel: [<ffffffff8017a2a0>] notifier_call_chain+0x70/0xb8 kernel: [<ffffffff8017a4e4>] __blocking_notifier_call_chain+0x4c/0x70 kernel: [<ffffffff8053aff8>] rpc_fill_super+0xf8/0x1a0 kernel: [<ffffffff8022204c>] mount_ns+0xb4/0xf0 kernel: [<ffffffff80222b48>] mount_fs+0x50/0x1f8 kernel: [<ffffffff8023dc00>] vfs_kern_mount+0x58/0xf0 kernel: [<ffffffff802404ac>] do_mount+0x27c/0xa28 kernel: [<ffffffff80240cf0>] SyS_mount+0x98/0xe8 kernel: [<ffffffff80135d24>] handle_sys64+0x44/0x68 kernel: kernel: Code: 0040f809 00000000 2e020001 <00020336> 3c12c00d 3c02801a de100000 6442eb98 0040f809 kernel: ---[ end trace 7471374335809536 ]--- Fixed this behaviour by calling register_pernet_subsys(&nfsd_net_ops) before registering rpc_pipefs_event(...) with the notifier chain. Signed-off-by: Giuseppe Cantavenera <giuseppe.cantavenera.ext@nokia.com> Signed-off-by: Lorenzo Restelli <lorenzo.restelli.ext@nokia.com> Reviewed-by: Kinlong Mee <kinglongmee@gmail.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-05-06nfsd: eliminate NFSD_DEBUGMark Salter
commit 135dd002c23054aaa056ea3162c1e0356905c195 upstream. Commit f895b252d4edf ("sunrpc: eliminate RPC_DEBUG") introduced use of IS_ENABLED() in a uapi header which leads to a build failure for userspace apps trying to use <linux/nfsd/debug.h>: linux/nfsd/debug.h:18:15: error: missing binary operator before token "(" #if IS_ENABLED(CONFIG_SUNRPC_DEBUG) ^ Since this was only used to define NFSD_DEBUG if CONFIG_SUNRPC_DEBUG is enabled, replace instances of NFSD_DEBUG with CONFIG_SUNRPC_DEBUG. Fixes: f895b252d4edf "sunrpc: eliminate RPC_DEBUG" Signed-off-by: Mark Salter <msalter@redhat.com> Reviewed-by: Jeff Layton <jlayton@primarydata.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-05-06nfsd4: disallow SEEK with special stateidsJ. Bruce Fields
commit 980608fb50aea34993ba956b71cd4602aa42b14b upstream. If the client uses a special stateid then we'll pass a NULL file to vfs_llseek. Fixes: 24bab491220f " NFSD: Implement SEEK" Cc: Anna Schumaker <Anna.Schumaker@Netapp.com> Reported-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-05-06nfsd4: fix READ permission checkingJ. Bruce Fields
commit 6e4891dc289cd191d46ab7ba1dcb29646644f9ca upstream. In the case we already have a struct file (derived from a stateid), we still need to do permission-checking; otherwise an unauthorized user could gain access to a file by sniffing or guessing somebody else's stateid. Fixes: dc97618ddda9 "nfsd4: separate splice and readv cases" Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-05-06nfsd4: disallow ALLOCATE with special stateidsJ. Bruce Fields
commit 5ba4a25ab7b13be528b23f85182f4d09cf7f71ad upstream. vfs_fallocate will hit a NULL dereference if the client tries an ALLOCATE or DEALLOCATE with a special stateid. Fix that. (We also depend on the open to have broken any conflicting leases or delegations for us.) (If it turns out we need to allow special stateid's then we could do a temporary open here in the special-stateid case, as we do for read and write. For now I'm assuming it's not necessary.) Fixes: 95d871f03cae "nfsd: Add ALLOCATE support" Cc: Anna Schumaker <Anna.Schumaker@Netapp.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-05-06Revert "nfs: replace nfs_add_stats with nfs_inc_stats when add one"Nicolas Iooss
commit 3708f842e107b9b79d54a75d152e666b693649e8 upstream. This reverts commit 5a254d08b086d80cbead2ebcee6d2a4b3a15587a. Since commit 5a254d08b086 ("nfs: replace nfs_add_stats with nfs_inc_stats when add one"), nfs_readpage and nfs_do_writepage use nfs_inc_stats to increment NFSIOS_READPAGES and NFSIOS_WRITEPAGES instead of nfs_add_stats. However nfs_inc_stats does not do the same thing as nfs_add_stats with value 1 because these functions work on distinct stats: nfs_inc_stats increments stats from "enum nfs_stat_eventcounters" (in server->io_stats->events) and nfs_add_stats those from "enum nfs_stat_bytecounters" (in server->io_stats->bytes). Signed-off-by: Nicolas Iooss <nicolas.iooss_linux@m4x.org> Fixes: 5a254d08b086 ("nfs: replace nfs_add_stats with nfs_inc_stats...") Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-05-06RCU pathwalk breakage when running into a symlink overmounting somethingAl Viro
commit 3cab989afd8d8d1bc3d99fef0e7ed87c31e7b647 upstream. Calling unlazy_walk() in walk_component() and do_last() when we find a symlink that needs to be followed doesn't acquire a reference to vfsmount. That's fine when the symlink is on the same vfsmount as the parent directory (which is almost always the case), but it's not always true - one _can_ manage to bind a symlink on top of something. And in such cases we end up with excessive mntput(). Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-05-06mnt: Update detach_mounts to leave mounts connectedEric W. Biederman
commit e0c9c0afd2fc958ffa34b697972721d81df8a56f upstream. Now that it is possible to lazily unmount an entire mount tree and leave the individual mounts connected to each other add a new flag UMOUNT_CONNECTED to umount_tree to force this behavior and use this flag in detach_mounts. This closes a bug where the deletion of a file or directory could trigger an unmount and reveal data under a mount point. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-05-06mnt: Fix the error check in __detach_mountsEric W. Biederman
commit f53e57975151f54ad8caa1b0ac8a78091cd5700a upstream. lookup_mountpoint can return either NULL or an error value. Update the test in __detach_mounts to test for an error value to avoid pathological cases causing a NULL pointer dereferences. The callers of __detach_mounts should prevent it from ever being called on an unlinked dentry but don't take any chances. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-05-06mnt: Honor MNT_LOCKED when detaching mountsEric W. Biederman
commit ce07d891a0891d3c0d0c2d73d577490486b809e1 upstream. Modify umount(MNT_DETACH) to keep mounts in the hash table that are locked to their parent mounts, when the parent is lazily unmounted. In mntput_no_expire detach the children from the hash table, depending on mnt_pin_kill in cleanup_mnt to decrement the mnt_count of the children. In __detach_mounts if there are any mounts that have been unmounted but still are on the list of mounts of a mountpoint, remove their children from the mount hash table and those children to the unmounted list so they won't linger potentially indefinitely waiting for their final mntput, now that the mounts serve no purpose. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-05-06mnt: Factor umount_mnt from umount_treeEric W. Biederman
commit 6a46c5735c29175da55b2fa9d53775182422cdd7 upstream. For future use factor out a function umount_mnt from umount_tree. This function unhashes a mount and remembers where the mount was mounted so that eventually when the code makes it to a sleeping context the mountpoint can be dput. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-05-06mnt: Factor out unhash_mnt from detach_mnt and umount_treeEric W. Biederman
commit 7bdb11de8ee4f4ae195e2fa19efd304e0b36c63b upstream. Create a function unhash_mnt that contains the common code between detach_mnt and umount_tree, and use unhash_mnt in place of the common code. This add a unncessary list_del_init(mnt->mnt_child) into umount_tree but given that mnt_child is already empty this extra line is a noop. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-05-06mnt: Don't propagate unmounts to locked mountsEric W. Biederman
commit 0c56fe31420ca599c90240315f7959bf1b4eb6ce upstream. If the first mount in shared subtree is locked don't unmount the shared subtree. This is ensured by walking through the mounts parents before children and marking a mount as unmountable if it is not locked or it is locked but it's parent is marked. This allows recursive mount detach to propagate through a set of mounts when unmounting them would not reveal what is under any locked mount. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-05-06mnt: On an unmount propagate clearing of MNT_LOCKEDEric W. Biederman
commit 5d88457eb5b86b475422dc882f089203faaeedb5 upstream. A prerequisite of calling umount_tree is that the point where the tree is mounted at is valid to unmount. If we are propagating the effect of the unmount clear MNT_LOCKED in every instance where the same filesystem is mounted on the same mountpoint in the mount tree, as we know (by virtue of the fact that umount_tree was called) that it is safe to reveal what is at that mountpoint. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-05-06mnt: Delay removal from the mount hash.Eric W. Biederman
commit 411a938b5abc9cb126c41cccf5975ae464fe0f3e upstream. - Modify __lookup_mnt_hash_last to ignore mounts that have MNT_UMOUNTED set. - Don't remove mounts from the mount hash table in propogate_umount - Don't remove mounts from the mount hash table in umount_tree before the entire list of mounts to be umounted is selected. - Remove mounts from the mount hash table as the last thing that happens in the case where a mount has a parent in umount_tree. Mounts without parents are not hashed (by definition). This paves the way for delaying removal from the mount hash table even farther and fixing the MNT_LOCKED vs MNT_DETACH issue. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-05-06mnt: Add MNT_UMOUNT flagEric W. Biederman
commit 590ce4bcbfb4e0462a720a4ad901e84416080bba upstream. In some instances it is necessary to know if the the unmounting process has begun on a mount. Add MNT_UMOUNT to make that reliably testable. This fix gets used in fixing locked mounts in MNT_DETACH Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-05-06mnt: In umount_tree reuse mnt_list instead of mnt_hashEric W. Biederman
commit c003b26ff98ca04a180ff34c38c007a3998d62f9 upstream. umount_tree builds a list of mounts that need to be unmounted. Utilize mnt_list for this purpose instead of mnt_hash. This begins to allow keeping a mount on the mnt_hash after it is unmounted, which is necessary for a properly functioning MNT_LOCKED implementation. The fact that mnt_list is an ordinary list makding available list_move is nice bonus. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-05-06mnt: Don't propagate umounts in __detach_mountsEric W. Biederman
commit 8318e667f176f7ea34451a1a530634e293f216ac upstream. Invoking mount propagation from __detach_mounts is inefficient and wrong. It is inefficient because __detach_mounts already walks the list of mounts that where something needs to be done, and mount propagation walks some subset of those mounts again. It is actively wrong because if the dentry that is passed to __detach_mounts is not part of the path to a mount that mount should not be affected. change_mnt_propagation(p,MS_PRIVATE) modifies the mount propagation tree of a master mount so it's slaves are connected to another master if possible. Which means even removing a mount from the middle of a mount tree with __detach_mounts will not deprive any mount propagated mount events. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-05-06mnt: Improve the umount_tree flagsEric W. Biederman
commit e819f152104c9f7c9fe50e1aecce6f5d4bf06d65 upstream. - Remove the unneeded declaration from pnode.h - Mark umount_tree static as it has no callers outside of namespace.c - Define an enumeration of umount_tree's flags. - Pass umount_tree's flags in by name This removes the magic numbers 0, 1 and 2 making the code a little clearer and makes it possible for there to be lazy unmounts that don't propagate. Which is what __detach_mounts actually wants for example. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-05-06ext4: make fsync to sync parent dir in no-journal for real this timeLukas Czerner
commit e12fb97222fc41e8442896934f76d39ef99b590a upstream. Previously commit 14ece1028b3ed53ffec1b1213ffc6acaf79ad77c added a support for for syncing parent directory of newly created inodes to make sure that the inode is not lost after a power failure in no-journal mode. However this does not work in majority of cases, namely: - if the directory has inline data - if the directory is already indexed - if the directory already has at least one block and: - the new entry fits into it - or we've successfully converted it to indexed So in those cases we might lose the inode entirely even after fsync in the no-journal mode. This also includes ext2 default mode obviously. I've noticed this while running xfstest generic/321 and even though the test should fail (we need to run fsck after a crash in no-journal mode) I could not find a newly created entries even when if it was fsynced before. Fix this by adjusting the ext4_add_entry() successful exit paths to set the inode EXT4_STATE_NEWENTRY so that fsync has the chance to fsync the parent directory as well. Signed-off-by: Lukas Czerner <lczerner@redhat.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Frank Mayhar <fmayhar@google.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-05-06fs/binfmt_elf.c: fix bug in loading of PIE binariesMichael Davidson
commit a87938b2e246b81b4fb713edb371a9fa3c5c3c86 upstream. With CONFIG_ARCH_BINFMT_ELF_RANDOMIZE_PIE enabled, and a normal top-down address allocation strategy, load_elf_binary() will attempt to map a PIE binary into an address range immediately below mm->mmap_base. Unfortunately, load_elf_ binary() does not take account of the need to allocate sufficient space for the entire binary which means that, while the first PT_LOAD segment is mapped below mm->mmap_base, the subsequent PT_LOAD segment(s) end up being mapped above mm->mmap_base into the are that is supposed to be the "gap" between the stack and the binary. Since the size of the "gap" on x86_64 is only guaranteed to be 128MB this means that binaries with large data segments > 128MB can end up mapping part of their data segment over their stack resulting in corruption of the stack (and the data segment once the binary starts to run). Any PIE binary with a data segment > 128MB is vulnerable to this although address randomization means that the actual gap between the stack and the end of the binary is normally greater than 128MB. The larger the data segment of the binary the higher the probability of failure. Fix this by calculating the total size of the binary in the same way as load_elf_interp(). Signed-off-by: Michael Davidson <md@google.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Jiri Kosina <jkosina@suse.cz> Cc: Kees Cook <keescook@chromium.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-05-06NFS: fix BUG() crash in notify_change() with patch to chown_common()Andrew Elble
commit c1b8940b42bb6487b10f2267a96b486276ce9ff7 upstream. We have observed a BUG() crash in fs/attr.c:notify_change(). The crash occurs during an rsync into a filesystem that is exported via NFS. 1.) fs/attr.c:notify_change() modifies the caller's version of attr. 2.) 6de0ec00ba8d ("VFS: make notify_change pass ATTR_KILL_S*ID to setattr operations") introduced a BUG() restriction such that "no function will ever call notify_change() with both ATTR_MODE and ATTR_KILL_S*ID set". Under some circumstances though, it will have assisted in setting the caller's version of attr to this very combination. 3.) 27ac0ffeac80 ("locks: break delegations on any attribute modification") introduced code to handle breaking delegations. This can result in notify_change() being re-called. attr _must_ be explicitly reset to avoid triggering the BUG() established in #2. 4.) The path that that triggers this is via fs/open.c:chmod_common(). The combination of attr flags set here and in the first call to notify_change() along with a later failed break_deleg_wait() results in notify_change() being called again via retry_deleg without resetting attr. Solution is to move retry_deleg in chmod_common() a bit further up to ensure attr is completely reset. There are other places where this seemingly could occur, such as fs/utimes.c:utimes_common(), but the attr flags are not initially set in such a way to trigger this. Fixes: 27ac0ffeac80 ("locks: break delegations on any attribute modification") Reported-by: Eric Meddaugh <etmsys@rit.edu> Tested-by: Eric Meddaugh <etmsys@rit.edu> Signed-off-by: Andrew Elble <aweits@rit.edu> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-05-06Btrfs: fix inode eviction infinite loop after extent_same ioctlFilipe Manana
commit 113e8283869b9855c8b999796aadd506bbac155f upstream. If we pass a length of 0 to the extent_same ioctl, we end up locking an extent range with a start offset greater then its end offset (if the destination file's offset is greater than zero). This results in a warning from extent_io.c:insert_state through the following call chain: btrfs_extent_same() btrfs_double_lock() lock_extent_range() lock_extent(inode->io_tree, offset, offset + len - 1) lock_extent_bits() __set_extent_bit() insert_state() --> WARN_ON(end < start) This leads to an infinite loop when evicting the inode. This is the same problem that my previous patch titled "Btrfs: fix inode eviction infinite loop after cloning into it" addressed but for the extent_same ioctl instead of the clone ioctl. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Omar Sandoval <osandov@osandov.com> Signed-off-by: Chris Mason <clm@fb.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-05-06Btrfs: fix inode eviction infinite loop after cloning into itFilipe Manana
commit ccccf3d67294714af2d72a6fd6fd7d73b01c9329 upstream. If we attempt to clone a 0 length region into a file we can end up inserting a range in the inode's extent_io tree with a start offset that is greater then the end offset, which triggers immediately the following warning: [ 3914.619057] WARNING: CPU: 17 PID: 4199 at fs/btrfs/extent_io.c:435 insert_state+0x4b/0x10b [btrfs]() [ 3914.620886] BTRFS: end < start 4095 4096 (...) [ 3914.638093] Call Trace: [ 3914.638636] [<ffffffff81425fd9>] dump_stack+0x4c/0x65 [ 3914.639620] [<ffffffff81045390>] warn_slowpath_common+0xa1/0xbb [ 3914.640789] [<ffffffffa03ca44f>] ? insert_state+0x4b/0x10b [btrfs] [ 3914.642041] [<ffffffff810453f0>] warn_slowpath_fmt+0x46/0x48 [ 3914.643236] [<ffffffffa03ca44f>] insert_state+0x4b/0x10b [btrfs] [ 3914.644441] [<ffffffffa03ca729>] __set_extent_bit+0x107/0x3f4 [btrfs] [ 3914.645711] [<ffffffffa03cb256>] lock_extent_bits+0x65/0x1bf [btrfs] [ 3914.646914] [<ffffffff8142b2fb>] ? _raw_spin_unlock+0x28/0x33 [ 3914.648058] [<ffffffffa03cbac4>] ? test_range_bit+0xcc/0xde [btrfs] [ 3914.650105] [<ffffffffa03cb3c3>] lock_extent+0x13/0x15 [btrfs] [ 3914.651361] [<ffffffffa03db39e>] lock_extent_range+0x3d/0xcd [btrfs] [ 3914.652761] [<ffffffffa03de1fe>] btrfs_ioctl_clone+0x278/0x388 [btrfs] [ 3914.654128] [<ffffffff811226dd>] ? might_fault+0x58/0xb5 [ 3914.655320] [<ffffffffa03e0909>] btrfs_ioctl+0xb51/0x2195 [btrfs] (...) [ 3914.669271] ---[ end trace 14843d3e2e622fc1 ]--- This later makes the inode eviction handler enter an infinite loop that keeps dumping the following warning over and over: [ 3915.117629] WARNING: CPU: 22 PID: 4228 at fs/btrfs/extent_io.c:435 insert_state+0x4b/0x10b [btrfs]() [ 3915.119913] BTRFS: end < start 4095 4096 (...) [ 3915.137394] Call Trace: [ 3915.137913] [<ffffffff81425fd9>] dump_stack+0x4c/0x65 [ 3915.139154] [<ffffffff81045390>] warn_slowpath_common+0xa1/0xbb [ 3915.140316] [<ffffffffa03ca44f>] ? insert_state+0x4b/0x10b [btrfs] [ 3915.141505] [<ffffffff810453f0>] warn_slowpath_fmt+0x46/0x48 [ 3915.142709] [<ffffffffa03ca44f>] insert_state+0x4b/0x10b [btrfs] [ 3915.143849] [<ffffffffa03ca729>] __set_extent_bit+0x107/0x3f4 [btrfs] [ 3915.145120] [<ffffffffa038c1e3>] ? btrfs_kill_super+0x17/0x23 [btrfs] [ 3915.146352] [<ffffffff811548f6>] ? deactivate_locked_super+0x3b/0x50 [ 3915.147565] [<ffffffffa03cb256>] lock_extent_bits+0x65/0x1bf [btrfs] [ 3915.148785] [<ffffffff8142b7e2>] ? _raw_write_unlock+0x28/0x33 [ 3915.149931] [<ffffffffa03bc325>] btrfs_evict_inode+0x196/0x482 [btrfs] [ 3915.151154] [<ffffffff81168904>] evict+0xa0/0x148 [ 3915.152094] [<ffffffff811689e5>] dispose_list+0x39/0x43 [ 3915.153081] [<ffffffff81169564>] evict_inodes+0xdc/0xeb [ 3915.154062] [<ffffffff81154418>] generic_shutdown_super+0x49/0xef [ 3915.155193] [<ffffffff811546d1>] kill_anon_super+0x13/0x1e [ 3915.156274] [<ffffffffa038c1e3>] btrfs_kill_super+0x17/0x23 [btrfs] (...) [ 3915.167404] ---[ end trace 14843d3e2e622fc2 ]--- So just bail out of the clone ioctl if the length of the region to clone is zero, without locking any extent range, in order to prevent this issue (same behaviour as a pwrite with a 0 length for example). This is trivial to reproduce. For example, the steps for the test I just made for fstests: mkfs.btrfs -f SCRATCH_DEV mount SCRATCH_DEV $SCRATCH_MNT touch $SCRATCH_MNT/foo touch $SCRATCH_MNT/bar $CLONER_PROG -s 0 -d 4096 -l 0 $SCRATCH_MNT/foo $SCRATCH_MNT/bar umount $SCRATCH_MNT A test case for fstests follows soon. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Omar Sandoval <osandov@osandov.com> Signed-off-by: Chris Mason <clm@fb.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-05-06btrfs: don't accept bare namespace as a valid xattrDavid Sterba
commit 3c3b04d10ff1811a27f86684ccd2f5ba6983211d upstream. Due to insufficient check in btrfs_is_valid_xattr, this unexpectedly works: $ touch file $ setfattr -n user. -v 1 file $ getfattr -d file user.="1" ie. the missing attribute name after the namespace. Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=94291 Reported-by: William Douglas <william.douglas@intel.com> Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-05-06Btrfs: fix log tree corruption when fs mounted with -o discardFilipe Manana
commit dcc82f4783ad91d4ab654f89f37ae9291cdc846a upstream. While committing a transaction we free the log roots before we write the new super block. Freeing the log roots implies marking the disk location of every node/leaf (metadata extent) as pinned before the new super block is written. This is to prevent the disk location of log metadata extents from being reused before the new super block is written, otherwise we would have a corrupted log tree if before the new super block is written a crash/reboot happens and the location of any log tree metadata extent ended up being reused and rewritten. Even though we pinned the log tree's metadata extents, we were issuing a discard against them if the fs was mounted with the -o discard option, resulting in corruption of the log tree if a crash/reboot happened before writing the new super block - the next time the fs was mounted, during the log replay process we would find nodes/leafs of the log btree with a content full of zeroes, causing the process to fail and require the use of the tool btrfs-zero-log to wipeout the log tree (and all data previously fsynced becoming lost forever). Fix this by not doing a discard when pinning an extent. The discard will be done later when it's safe (after the new super block is committed) at extent-tree.c:btrfs_finish_extent_commit(). Fixes: e688b7252f78 (Btrfs: fix extent pinning bugs in the tree log) Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-04-29fs: take i_mutex during prepare_binprm for set[ug]id executablesJann Horn
commit 8b01fc86b9f425899f8a3a8fc1c47d73c2c20543 upstream. This prevents a race between chown() and execve(), where chowning a setuid-user binary to root would momentarily make the binary setuid root. This patch was mostly written by Linus Torvalds. Signed-off-by: Jann Horn <jann@thejh.net> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-04-12Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs and fs fixes from Al Viro: "Several AIO and OCFS2 fixes" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: ocfs2: _really_ sync the right range ocfs2_file_write_iter: keep return value and current position update in sync [regression] ocfs2: do *not* increment ->ki_pos twice ioctx_alloc(): fix vma (and file) leak on failure fix mremap() vs. ioctx_kill() race
2015-04-09ocfs2: _really_ sync the right rangeAl Viro
"ocfs2 syncs the wrong range" had been broken; prior to it the code was doing the wrong thing in case of O_APPEND, all right, but _after_ it we were syncing the wrong range in 100% cases. *ppos, aka iocb->ki_pos is incremented prior to that point, so we are always doing sync on the area _after_ the one we'd written to. Spotted by Joseph Qi <joseph.qi@huawei.com> back in January; unfortunately, I'd missed his mail back then ;-/ Cc: stable@vger.kernel.org Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-04-08ocfs2_file_write_iter: keep return value and current position update in syncAl Viro
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-04-08[regression] ocfs2: do *not* increment ->ki_pos twiceAl Viro
generic_file_direct_write() already does that. Broken by "ocfs2: do not fallback to buffer I/O write if appending" Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-04-06ioctx_alloc(): fix vma (and file) leak on failureAl Viro
If we fail past the aio_setup_ring(), we need to destroy the mapping. We don't need to care about anybody having found ctx, or added requests to it, since the last failure exit is exactly the failure to make ctx visible to lookups. Reproducer (based on one by Joe Mario <jmario@redhat.com>): void count(char *p) { char s[80]; printf("%s: ", p); fflush(stdout); sprintf(s, "/bin/cat /proc/%d/maps|/bin/fgrep -c '/[aio] (deleted)'", getpid()); system(s); } int main() { io_context_t *ctx; int created, limit, i, destroyed; FILE *f; count("before"); if ((f = fopen("/proc/sys/fs/aio-max-nr", "r")) == NULL) perror("opening aio-max-nr"); else if (fscanf(f, "%d", &limit) != 1) fprintf(stderr, "can't parse aio-max-nr\n"); else if ((ctx = calloc(limit, sizeof(io_context_t))) == NULL) perror("allocating aio_context_t array"); else { for (i = 0, created = 0; i < limit; i++) { if (io_setup(1000, ctx + created) == 0) created++; } for (i = 0, destroyed = 0; i < created; i++) if (io_destroy(ctx[i]) == 0) destroyed++; printf("created %d, failed %d, destroyed %d\n", created, limit - created, destroyed); count("after"); } } Found-by: Joe Mario <jmario@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-04-06fix mremap() vs. ioctx_kill() raceAl Viro
teach ->mremap() method to return an error and have it fail for aio mappings in process of being killed Note that in case of ->mremap() failure we need to undo move_page_tables() we'd already done; we could call ->mremap() first, but then the failure of move_page_tables() would require undoing whatever _successful_ ->mremap() has done, which would be a lot more headache in general. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-04-03Merge branch 'for-next' of git://git.samba.org/sfrench/cifs-2.6Linus Torvalds
Pull CIFS fixes from Steve French: "A set of small cifs fixes fixing a memory leak, kernel oops, and infinite loop (and some spotted by Coverity)" * 'for-next' of git://git.samba.org/sfrench/cifs-2.6: Fix warning Fix another dereference before null check warning CIFS: session servername can't be null Fix warning on impossible comparison Fix coverity warning Fix dereference before null check warning Don't ignore errors on encrypting password in SMBTcon Fix warning on uninitialized buftype cifs: potential memory leaks when parsing mnt opts cifs: fix use-after-free bug in find_writable_file cifs: smb2_clone_range() - exit on unhandled error
2015-04-01Merge tag 'lazytime_fix' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull lazytime fixes from Ted Ts'o: "This fixes a problem in the lazy time patches, which can cause frequently updated inods to never have their timestamps updated. These changes guarantee that no timestamp on disk will be stale by more than 24 hours" * tag 'lazytime_fix' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: fs: add dirtytime_expire_seconds sysctl fs: make sure the timestamps for lazytime inodes eventually get written
2015-04-01Merge branch 'for-4.0' of git://linux-nfs.org/~bfields/linuxLinus Torvalds
Pull nfsd fixes from Bruce Fields: "Two main issues: - We found that turning on pNFS by default (when it's configured at build time) was too aggressive, so we want to switch the default before the 4.0 release. - Recent client changes to increase open parallelism uncovered a serious bug lurking in the server's open code. Also fix a krb5/selinux regression. The rest is mainly smaller pNFS fixes" * 'for-4.0' of git://linux-nfs.org/~bfields/linux: sunrpc: make debugfs file creation failure non-fatal nfsd: require an explicit option to enable pNFS NFSD: Fix bad update of layout in nfsd4_return_file_layout NFSD: Take care the return value from nfsd4_encode_stateid NFSD: Printk blocklayout length and offset as format 0x%llx nfsd: return correct lockowner when there is a race on hash insert nfsd: return correct openowner when there is a race to put one in the hash NFSD: Put exports after nfsd4_layout_verify fail NFSD: Error out when register_shrinker() fail NFSD: Take care the return value from nfsd4_decode_stateid NFSD: Check layout type when returning client layouts NFSD: restore trace event lost in mismerge
2015-04-01Fix warningSteve French
Coverity reports a warning due to unitialized attr structure in one code path. Reported by Coverity (CID 728535) Signed-off-by: Steve French <smfrench@gmail.com> Reviewed-by: Jeff Layton <jlayton@samba.org>