aboutsummaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)Author
2015-05-13ext4: fix data corruption caused by unwritten and delayed extentsLukas Czerner
commit d2dc317d564a46dfc683978a2e5a4f91434e9711 upstream. Currently it is possible to lose whole file system block worth of data when we hit the specific interaction with unwritten and delayed extents in status extent tree. The problem is that when we insert delayed extent into extent status tree the only way to get rid of it is when we write out delayed buffer. However there is a limitation in the extent status tree implementation so that when inserting unwritten extent should there be even a single delayed block the whole unwritten extent would be marked as delayed. At this point, there is no way to get rid of the delayed extents, because there are no delayed buffers to write out. So when a we write into said unwritten extent we will convert it to written, but it still remains delayed. When we try to write into that block later ext4_da_map_blocks() will set the buffer new and delayed and map it to invalid block which causes the rest of the block to be zeroed loosing already written data. For now we can fix this by simply not allowing to set delayed status on written extent in the extent status tree. Also add WARN_ON() to make sure that we notice if this happens in the future. This problem can be easily reproduced by running the following xfs_io. xfs_io -f -c "pwrite -S 0xaa 4096 2048" \ -c "falloc 0 131072" \ -c "pwrite -S 0xbb 65536 2048" \ -c "fsync" /mnt/test/fff echo 3 > /proc/sys/vm/drop_caches xfs_io -c "pwrite -S 0xdd 67584 2048" /mnt/test/fff This can be theoretically also reproduced by at random by running fsx, but it's not very reliable, though on machines with bigger page size (like ppc) this can be seen more often (especially xfstest generic/127) Signed-off-by: Lukas Czerner <lczerner@redhat.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-05-06RCU pathwalk breakage when running into a symlink overmounting somethingAl Viro
commit 3cab989afd8d8d1bc3d99fef0e7ed87c31e7b647 upstream. Calling unlazy_walk() in walk_component() and do_last() when we find a symlink that needs to be followed doesn't acquire a reference to vfsmount. That's fine when the symlink is on the same vfsmount as the parent directory (which is almost always the case), but it's not always true - one _can_ manage to bind a symlink on top of something. And in such cases we end up with excessive mntput(). Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-05-06ext4: make fsync to sync parent dir in no-journal for real this timeLukas Czerner
commit e12fb97222fc41e8442896934f76d39ef99b590a upstream. Previously commit 14ece1028b3ed53ffec1b1213ffc6acaf79ad77c added a support for for syncing parent directory of newly created inodes to make sure that the inode is not lost after a power failure in no-journal mode. However this does not work in majority of cases, namely: - if the directory has inline data - if the directory is already indexed - if the directory already has at least one block and: - the new entry fits into it - or we've successfully converted it to indexed So in those cases we might lose the inode entirely even after fsync in the no-journal mode. This also includes ext2 default mode obviously. I've noticed this while running xfstest generic/321 and even though the test should fail (we need to run fsck after a crash in no-journal mode) I could not find a newly created entries even when if it was fsynced before. Fix this by adjusting the ext4_add_entry() successful exit paths to set the inode EXT4_STATE_NEWENTRY so that fsync has the chance to fsync the parent directory as well. Signed-off-by: Lukas Czerner <lczerner@redhat.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Frank Mayhar <fmayhar@google.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-05-06fs/binfmt_elf.c: fix bug in loading of PIE binariesMichael Davidson
commit a87938b2e246b81b4fb713edb371a9fa3c5c3c86 upstream. With CONFIG_ARCH_BINFMT_ELF_RANDOMIZE_PIE enabled, and a normal top-down address allocation strategy, load_elf_binary() will attempt to map a PIE binary into an address range immediately below mm->mmap_base. Unfortunately, load_elf_ binary() does not take account of the need to allocate sufficient space for the entire binary which means that, while the first PT_LOAD segment is mapped below mm->mmap_base, the subsequent PT_LOAD segment(s) end up being mapped above mm->mmap_base into the are that is supposed to be the "gap" between the stack and the binary. Since the size of the "gap" on x86_64 is only guaranteed to be 128MB this means that binaries with large data segments > 128MB can end up mapping part of their data segment over their stack resulting in corruption of the stack (and the data segment once the binary starts to run). Any PIE binary with a data segment > 128MB is vulnerable to this although address randomization means that the actual gap between the stack and the end of the binary is normally greater than 128MB. The larger the data segment of the binary the higher the probability of failure. Fix this by calculating the total size of the binary in the same way as load_elf_interp(). Signed-off-by: Michael Davidson <md@google.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Jiri Kosina <jkosina@suse.cz> Cc: Kees Cook <keescook@chromium.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-05-06Btrfs: fix inode eviction infinite loop after cloning into itFilipe Manana
commit ccccf3d67294714af2d72a6fd6fd7d73b01c9329 upstream. If we attempt to clone a 0 length region into a file we can end up inserting a range in the inode's extent_io tree with a start offset that is greater then the end offset, which triggers immediately the following warning: [ 3914.619057] WARNING: CPU: 17 PID: 4199 at fs/btrfs/extent_io.c:435 insert_state+0x4b/0x10b [btrfs]() [ 3914.620886] BTRFS: end < start 4095 4096 (...) [ 3914.638093] Call Trace: [ 3914.638636] [<ffffffff81425fd9>] dump_stack+0x4c/0x65 [ 3914.639620] [<ffffffff81045390>] warn_slowpath_common+0xa1/0xbb [ 3914.640789] [<ffffffffa03ca44f>] ? insert_state+0x4b/0x10b [btrfs] [ 3914.642041] [<ffffffff810453f0>] warn_slowpath_fmt+0x46/0x48 [ 3914.643236] [<ffffffffa03ca44f>] insert_state+0x4b/0x10b [btrfs] [ 3914.644441] [<ffffffffa03ca729>] __set_extent_bit+0x107/0x3f4 [btrfs] [ 3914.645711] [<ffffffffa03cb256>] lock_extent_bits+0x65/0x1bf [btrfs] [ 3914.646914] [<ffffffff8142b2fb>] ? _raw_spin_unlock+0x28/0x33 [ 3914.648058] [<ffffffffa03cbac4>] ? test_range_bit+0xcc/0xde [btrfs] [ 3914.650105] [<ffffffffa03cb3c3>] lock_extent+0x13/0x15 [btrfs] [ 3914.651361] [<ffffffffa03db39e>] lock_extent_range+0x3d/0xcd [btrfs] [ 3914.652761] [<ffffffffa03de1fe>] btrfs_ioctl_clone+0x278/0x388 [btrfs] [ 3914.654128] [<ffffffff811226dd>] ? might_fault+0x58/0xb5 [ 3914.655320] [<ffffffffa03e0909>] btrfs_ioctl+0xb51/0x2195 [btrfs] (...) [ 3914.669271] ---[ end trace 14843d3e2e622fc1 ]--- This later makes the inode eviction handler enter an infinite loop that keeps dumping the following warning over and over: [ 3915.117629] WARNING: CPU: 22 PID: 4228 at fs/btrfs/extent_io.c:435 insert_state+0x4b/0x10b [btrfs]() [ 3915.119913] BTRFS: end < start 4095 4096 (...) [ 3915.137394] Call Trace: [ 3915.137913] [<ffffffff81425fd9>] dump_stack+0x4c/0x65 [ 3915.139154] [<ffffffff81045390>] warn_slowpath_common+0xa1/0xbb [ 3915.140316] [<ffffffffa03ca44f>] ? insert_state+0x4b/0x10b [btrfs] [ 3915.141505] [<ffffffff810453f0>] warn_slowpath_fmt+0x46/0x48 [ 3915.142709] [<ffffffffa03ca44f>] insert_state+0x4b/0x10b [btrfs] [ 3915.143849] [<ffffffffa03ca729>] __set_extent_bit+0x107/0x3f4 [btrfs] [ 3915.145120] [<ffffffffa038c1e3>] ? btrfs_kill_super+0x17/0x23 [btrfs] [ 3915.146352] [<ffffffff811548f6>] ? deactivate_locked_super+0x3b/0x50 [ 3915.147565] [<ffffffffa03cb256>] lock_extent_bits+0x65/0x1bf [btrfs] [ 3915.148785] [<ffffffff8142b7e2>] ? _raw_write_unlock+0x28/0x33 [ 3915.149931] [<ffffffffa03bc325>] btrfs_evict_inode+0x196/0x482 [btrfs] [ 3915.151154] [<ffffffff81168904>] evict+0xa0/0x148 [ 3915.152094] [<ffffffff811689e5>] dispose_list+0x39/0x43 [ 3915.153081] [<ffffffff81169564>] evict_inodes+0xdc/0xeb [ 3915.154062] [<ffffffff81154418>] generic_shutdown_super+0x49/0xef [ 3915.155193] [<ffffffff811546d1>] kill_anon_super+0x13/0x1e [ 3915.156274] [<ffffffffa038c1e3>] btrfs_kill_super+0x17/0x23 [btrfs] (...) [ 3915.167404] ---[ end trace 14843d3e2e622fc2 ]--- So just bail out of the clone ioctl if the length of the region to clone is zero, without locking any extent range, in order to prevent this issue (same behaviour as a pwrite with a 0 length for example). This is trivial to reproduce. For example, the steps for the test I just made for fstests: mkfs.btrfs -f SCRATCH_DEV mount SCRATCH_DEV $SCRATCH_MNT touch $SCRATCH_MNT/foo touch $SCRATCH_MNT/bar $CLONER_PROG -s 0 -d 4096 -l 0 $SCRATCH_MNT/foo $SCRATCH_MNT/bar umount $SCRATCH_MNT A test case for fstests follows soon. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Omar Sandoval <osandov@osandov.com> Signed-off-by: Chris Mason <clm@fb.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-05-06Btrfs: fix log tree corruption when fs mounted with -o discardFilipe Manana
commit dcc82f4783ad91d4ab654f89f37ae9291cdc846a upstream. While committing a transaction we free the log roots before we write the new super block. Freeing the log roots implies marking the disk location of every node/leaf (metadata extent) as pinned before the new super block is written. This is to prevent the disk location of log metadata extents from being reused before the new super block is written, otherwise we would have a corrupted log tree if before the new super block is written a crash/reboot happens and the location of any log tree metadata extent ended up being reused and rewritten. Even though we pinned the log tree's metadata extents, we were issuing a discard against them if the fs was mounted with the -o discard option, resulting in corruption of the log tree if a crash/reboot happened before writing the new super block - the next time the fs was mounted, during the log replay process we would find nodes/leafs of the log btree with a content full of zeroes, causing the process to fail and require the use of the tool btrfs-zero-log to wipeout the log tree (and all data previously fsynced becoming lost forever). Fix this by not doing a discard when pinning an extent. The discard will be done later when it's safe (after the new super block is committed) at extent-tree.c:btrfs_finish_extent_commit(). Fixes: e688b7252f78 (Btrfs: fix extent pinning bugs in the tree log) Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-04-29dcache: Fix locking bugs in backported "deal with deadlock in d_walk()"Ben Hutchings
commit 20defcec264ceab2630356fb9d397f3d237b5e6d upstream in 3.2-stable Steven Rostedt reported: > Porting -rt to the latest 3.2 stable tree I triggered this bug: > > ===================================== > [ BUG: bad unlock balance detected! ] > ------------------------------------- > rm/1638 is trying to release lock (rcu_read_lock) at: > [<c04fde6c>] rcu_read_unlock+0x0/0x23 > but there are no more locks to release! > > other info that might help us debug this: > 2 locks held by rm/1638: > #0: (&sb->s_type->i_mutex_key#9/1){+.+.+.}, at: [<c04f93eb>] do_rmdir+0x5f/0xd2 > #1: (&sb->s_type->i_mutex_key#9){+.+.+.}, at: [<c04f9329>] vfs_rmdir+0x49/0xac > > stack backtrace: > Pid: 1638, comm: rm Not tainted 3.2.66-test-rt96+ #2 > Call Trace: > [<c083f390>] ? printk+0x1d/0x1f > [<c0463cdf>] print_unlock_inbalance_bug+0xc3/0xcd > [<c04653a8>] lock_release_non_nested+0x98/0x1ec > [<c046228d>] ? trace_hardirqs_off_caller+0x18/0x90 > [<c0456f1c>] ? local_clock+0x2d/0x50 > [<c04fde6c>] ? d_hash+0x2f/0x2f > [<c04fde6c>] ? d_hash+0x2f/0x2f > [<c046568e>] lock_release+0x192/0x1ad > [<c04fde83>] rcu_read_unlock+0x17/0x23 > [<c04ff344>] shrink_dcache_parent+0x227/0x270 > [<c04f9348>] vfs_rmdir+0x68/0xac > [<c04f9424>] do_rmdir+0x98/0xd2 > [<c04f03ad>] ? fput+0x1a3/0x1ab > [<c084dd42>] ? sysenter_exit+0xf/0x1a > [<c0465b58>] ? trace_hardirqs_on_caller+0x118/0x149 > [<c04fa3e0>] sys_unlinkat+0x2b/0x35 > [<c084dd13>] sysenter_do_call+0x12/0x12 > > > > > There's a path to calling rcu_read_unlock() without calling > rcu_read_lock() in have_submounts(). > > goto positive; > > positive: > if (!locked && read_seqretry(&rename_lock, seq)) > goto rename_retry; > > rename_retry: > rcu_read_unlock(); > > in the above path, rcu_read_lock() is never done before calling > rcu_read_unlock(); I reviewed locking contexts in all three functions that I changed when backporting "deal with deadlock in d_walk()". It's actually worse than this: - We don't hold this_parent->d_lock at the 'positive' label in have_submounts(), but it is unlocked after 'rename_retry'. - There is an rcu_read_unlock() after the 'out' label in select_parent(), but it's not held at the 'goto out'. Fix all three lock imbalances. Reported-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Ben Hutchings <ben@decadent.org.uk> Tested-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-04-29deal with deadlock in d_walk()Al Viro
commit ca5358ef75fc69fee5322a38a340f5739d997c10 upstream. ... by not hitting rename_retry for reasons other than rename having happened. In other words, do _not_ restart when finding that between unlocking the child and locking the parent the former got into __dentry_kill(). Skip the killed siblings instead... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Cc: Ben Hutchings <ben@decadent.org.uk> [hujianyang: Backported to 3.10 refer to the work of Ben Hutchings in 3.2: - As we only have try_to_ascend() and not d_walk(), apply this change to all callers of try_to_ascend() - Adjust context to make __dentry_kill() apply to d_kill()] Signed-off-by: hujianyang <hujianyang@huawei.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-04-29move d_rcu from overlapping d_child to overlapping d_aliasAl Viro
commit 946e51f2bf37f1656916eb75bd0742ba33983c28 upstream. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Cc: Ben Hutchings <ben@decadent.org.uk> [hujianyang: Backported to 3.10 refer to the work of Ben Hutchings in 3.2: - Apply name changes in all the different places we use d_alias and d_child - Move the WARN_ON() in __d_free() to d_free() as we don't have dentry_free()] Signed-off-by: hujianyang <hujianyang@huawei.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-04-29splice: Apply generic position and size checks to each writeBen Hutchings
commit 894c6350eaad7e613ae267504014a456e00a3e2a from the 3.2-stable branch. We need to check the position and size of file writes against various limits, using generic_write_check(). This was not being done for the splice write path. It was fixed upstream by commit 8d0207652cbe ("->splice_write() via ->write_iter()") but we can't apply that. CVE-2014-7822 Signed-off-by: Ben Hutchings <ben@decadent.org.uk> [Ben fixed it in 3.2 stable, i ported it to 3.10 stable] Signed-off-by: Zhang Zhen <zhenzhang.zhang@huawei.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-04-29jfs: fix readdir regressionDave Kleikamp
Upstream commit 44512449, "jfs: fix readdir cookie incompatibility with NFSv4", was backported incorrectly into the stable trees which used the filldir callback (rather than dir_emit). The position is being incorrectly passed to filldir for the . and .. entries. The still-maintained stable trees that need to be fixed are 3.2.y, 3.4.y and 3.10.y. https://bugzilla.kernel.org/show_bug.cgi?id=94741 Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com> Cc: jfs-discussion@lists.sourceforge.net Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-04-29remove extra definitions of U32_MAXAlex Elder
commit 04f9b74e4d96d349de12fdd4e6626af4a9f75e09 upstream. Now that the definition is centralized in <linux/kernel.h>, the definitions of U32_MAX (and related) elsewhere in the kernel can be removed. Signed-off-by: Alex Elder <elder@linaro.org> Acked-by: Sage Weil <sage@inktank.com> Acked-by: David S. Miller <davem@davemloft.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-04-29conditionally define U32_MAXAlex Elder
commit 77719536dc00f8fd8f5abe6dadbde5331c37f996 upstream. The symbol U32_MAX is defined in several spots. Change these definitions to be conditional. This is in preparation for the next patch, which centralizes the definition in <linux/kernel.h>. Signed-off-by: Alex Elder <elder@linaro.org> Cc: Sage Weil <sage@inktank.com> Cc: David Miller <davem@davemloft.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-04-19pagemap: do not leak physical addresses to non-privileged userspaceKirill A. Shutemov
commit ab676b7d6fbf4b294bf198fb27ade5b0e865c7ce upstream. As pointed by recent post[1] on exploiting DRAM physical imperfection, /proc/PID/pagemap exposes sensitive information which can be used to do attacks. This disallows anybody without CAP_SYS_ADMIN to read the pagemap. [1] http://googleprojectzero.blogspot.com/2015/03/exploiting-dram-rowhammer-bug-to-gain.html [ Eventually we might want to do anything more finegrained, but for now this is the simple model. - Linus ] Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Konstantin Khlebnikov <khlebnikov@openvz.org> Acked-by: Andy Lutomirski <luto@amacapital.net> Cc: Pavel Emelyanov <xemul@parallels.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Mark Seaborn <mseaborn@chromium.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: mancha security <mancha1@zoho.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-04-19ocfs2: _really_ sync the right rangeAl Viro
commit 64b4e2526d1cf6e6a4db6213d6e2b6e6ab59479a upstream. "ocfs2 syncs the wrong range" had been broken; prior to it the code was doing the wrong thing in case of O_APPEND, all right, but _after_ it we were syncing the wrong range in 100% cases. *ppos, aka iocb->ki_pos is incremented prior to that point, so we are always doing sync on the area _after_ the one we'd written to. Spotted by Joseph Qi <joseph.qi@huawei.com> back in January; unfortunately, I'd missed his mail back then ;-/ Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-04-19cifs: fix use-after-free bug in find_writable_fileDavid Disseldorp
commit e1e9bda22d7ddf88515e8fe401887e313922823e upstream. Under intermittent network outages, find_writable_file() is susceptible to the following race condition, which results in a user-after-free in the cifs_writepages code-path: Thread 1 Thread 2 ======== ======== inv_file = NULL refind = 0 spin_lock(&cifs_file_list_lock) // invalidHandle found on openFileList inv_file = open_file // inv_file->count currently 1 cifsFileInfo_get(inv_file) // inv_file->count = 2 spin_unlock(&cifs_file_list_lock); cifs_reopen_file() cifs_close() // fails (rc != 0) ->cifsFileInfo_put() spin_lock(&cifs_file_list_lock) // inv_file->count = 1 spin_unlock(&cifs_file_list_lock) spin_lock(&cifs_file_list_lock); list_move_tail(&inv_file->flist, &cifs_inode->openFileList); spin_unlock(&cifs_file_list_lock); cifsFileInfo_put(inv_file); ->spin_lock(&cifs_file_list_lock) // inv_file->count = 0 list_del(&cifs_file->flist); // cleanup!! kfree(cifs_file); spin_unlock(&cifs_file_list_lock); spin_lock(&cifs_file_list_lock); ++refind; // refind = 1 goto refind_writable; At this point we loop back through with an invalid inv_file pointer and a refind value of 1. On second pass, inv_file is not overwritten on openFileList traversal, and is subsequently dereferenced. Signed-off-by: David Disseldorp <ddiss@suse.de> Reviewed-by: Jeff Layton <jlayton@samba.org> Signed-off-by: Steve French <smfrench@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-04-13hfsplus: fix B-tree corruption after insertion at position 0Sergei Antonov
commit 98cf21c61a7f5419d82f847c4d77bf6e96a76f5f upstream. Fix B-tree corruption when a new record is inserted at position 0 in the node in hfs_brec_insert(). In this case a hfs_brec_update_parent() is called to update the parent index node (if exists) and it is passed hfs_find_data with a search_key containing a newly inserted key instead of the key to be updated. This results in an inconsistent index node. The bug reproduces on my machine after an extents overflow record for the catalog file (CNID=4) is inserted into the extents overflow B-tree. Because of a low (reserved) value of CNID=4, it has to become the first record in the first leaf node. The resulting first leaf node is correct: ---------------------------------------------------- | key0.CNID=4 | key1.CNID=123 | key2.CNID=456, ... | ---------------------------------------------------- But the parent index key0 still contains the previous key CNID=123: ----------------------- | key0.CNID=123 | ... | ----------------------- A change in hfs_brec_insert() makes hfs_brec_update_parent() work correctly by preventing it from getting fd->record=-1 value from __hfs_brec_find(). Along the way, I removed duplicate code with unification of the if condition. The resulting code is equivalent to the original code because node is never 0. Also hfs_brec_update_parent() will now return an error after getting a negative fd->record value. However, the return value of hfs_brec_update_parent() is not checked anywhere in the file and I'm leaving it unchanged by this patch. brec.c lacks error checking after some other calls too, but this issue is of less importance than the one being fixed by this patch. Signed-off-by: Sergei Antonov <saproj@gmail.com> Cc: Joe Perches <joe@perches.com> Reviewed-by: Vyacheslav Dubeyko <slava@dubeyko.com> Acked-by: Hin-Tak Leung <htl10@users.sourceforge.net> Cc: Anton Altaparmakov <aia21@cam.ac.uk> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christoph Hellwig <hch@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-03-26nilfs2: fix deadlock of segment constructor during recoveryRyusuke Konishi
commit 283ee1482f349d6c0c09dfb725db5880afc56813 upstream. According to a report from Yuxuan Shui, nilfs2 in kernel 3.19 got stuck during recovery at mount time. The code path that caused the deadlock was as follows: nilfs_fill_super() load_nilfs() nilfs_salvage_orphan_logs() * Do roll-forwarding, attach segment constructor for recovery, and kick it. nilfs_segctor_thread() nilfs_segctor_thread_construct() * A lock is held with nilfs_transaction_lock() nilfs_segctor_do_construct() nilfs_segctor_drop_written_files() iput() iput_final() write_inode_now() writeback_single_inode() __writeback_single_inode() do_writepages() nilfs_writepage() nilfs_construct_dsync_segment() nilfs_transaction_lock() --> deadlock This can happen if commit 7ef3ff2fea8b ("nilfs2: fix deadlock of segment constructor over I_SYNC flag") is applied and roll-forward recovery was performed at mount time. The roll-forward recovery can happen if datasync write is done and the file system crashes immediately after that. For instance, we can reproduce the issue with the following steps: < nilfs2 is mounted on /nilfs (device: /dev/sdb1) > # dd if=/dev/zero of=/nilfs/test bs=4k count=1 && sync # dd if=/dev/zero of=/nilfs/test conv=notrunc oflag=dsync bs=4k count=1 && reboot -nfh < the system will immediately reboot > # mount -t nilfs2 /dev/sdb1 /nilfs The deadlock occurs because iput() can run segment constructor through writeback_single_inode() if MS_ACTIVE flag is not set on sb->s_flags. The above commit changed segment constructor so that it calls iput() asynchronously for inodes with i_nlink == 0, but that change was imperfect. This fixes the another deadlock by deferring iput() in segment constructor even for the case that mount is not finished, that is, for the case that MS_ACTIVE flag is not set. Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp> Reported-by: Yuxuan Shui <yshuiv7@gmail.com> Tested-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-03-26fuse: notify: don't move pagesMiklos Szeredi
commit 0d2783626a53d4c922f82d51fa675cb5d13f0d36 upstream. fuse_try_move_page() is not prepared for replacing pages that have already been read. Reported-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-03-26fuse: set stolen page uptodateMiklos Szeredi
commit aa991b3b267e24f578bac7b09cc57579b660304b upstream. Regular pipe buffers' ->steal method (generic_pipe_buf_steal()) doesn't set PG_uptodate. Don't warn on this condition, just set the uptodate flag. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-03-18nilfs2: fix potential memory overrun on inodeRyusuke Konishi
commit 957ed60b53b519064a54988c4e31e0087e47d091 upstream. Each inode of nilfs2 stores a root node of a b-tree, and it turned out to have a memory overrun issue: Each b-tree node of nilfs2 stores a set of key-value pairs and the number of them (in "bn_nchildren" member of nilfs_btree_node struct), as well as a few other "bn_*" members. Since the value of "bn_nchildren" is used for operations on the key-values within the b-tree node, it can cause memory access overrun if a large number is incorrectly set to "bn_nchildren". For instance, nilfs_btree_node_lookup() function determines the range of binary search with it, and too large "bn_nchildren" leads nilfs_btree_node_get_key() in that function to overrun. As for intermediate b-tree nodes, this is prevented by a sanity check performed when each node is read from a drive, however, no sanity check has been done for root nodes stored in inodes. This patch fixes the issue by adding missing sanity check against b-tree root nodes so that it's called when on-memory inodes are read from ifile, inode metadata file. Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-03-18procfs: fix race between symlink removals and traversalsAl Viro
commit 7e0e953bb0cf649f93277ac8fb67ecbb7f7b04a9 upstream. use_pde()/unuse_pde() in ->follow_link()/->put_link() resp. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-03-18debugfs: leave freeing a symlink body until inode evictionAl Viro
commit 0db59e59299f0b67450c5db21f7f316c8fb04e84 upstream. As it is, we have debugfs_remove() racing with symlink traversals. Supply ->evict_inode() and do freeing there - inode will remain pinned until we are done with the symlink body. And rip the idiocy with checking if dentry is positive right after we'd verified debugfs_positive(), which is a stronger check... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-03-18autofs4 copy_dev_ioctl(): keep the value of ->size we'd used for allocationAl Viro
commit 0a280962dc6e117e0e4baa668453f753579265d9 upstream. X-Coverup: just ask spender Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-03-18Btrfs:__add_inode_ref: out of bounds memory read when looking for extended ref.Quentin Casasnovas
commit dd9ef135e3542ffc621c4eb7f0091870ec7a1504 upstream. Improper arithmetics when calculting the address of the extended ref could lead to an out of bounds memory read and kernel panic. Signed-off-by: Quentin Casasnovas <quentin.casasnovas@oracle.com> Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-03-18Btrfs: fix data loss in the fast fsync pathFilipe Manana
commit 3a8b36f378060d20062a0918e99fae39ff077bf0 upstream. When using the fast file fsync code path we can miss the fact that new writes happened since the last file fsync and therefore return without waiting for the IO to finish and write the new extents to the fsync log. Here's an example scenario where the fsync will miss the fact that new file data exists that wasn't yet durably persisted: 1. fs_info->last_trans_committed == N - 1 and current transaction is transaction N (fs_info->generation == N); 2. do a buffered write; 3. fsync our inode, this clears our inode's full sync flag, starts an ordered extent and waits for it to complete - when it completes at btrfs_finish_ordered_io(), the inode's last_trans is set to the value N (via btrfs_update_inode_fallback -> btrfs_update_inode -> btrfs_set_inode_last_trans); 4. transaction N is committed, so fs_info->last_trans_committed is now set to the value N and fs_info->generation remains with the value N; 5. do another buffered write, when this happens btrfs_file_write_iter sets our inode's last_trans to the value N + 1 (that is fs_info->generation + 1 == N + 1); 6. transaction N + 1 is started and fs_info->generation now has the value N + 1; 7. transaction N + 1 is committed, so fs_info->last_trans_committed is set to the value N + 1; 8. fsync our inode - because it doesn't have the full sync flag set, we only start the ordered extent, we don't wait for it to complete (only in a later phase) therefore its last_trans field has the value N + 1 set previously by btrfs_file_write_iter(), and so we have: inode->last_trans <= fs_info->last_trans_committed (N + 1) (N + 1) Which made us not log the last buffered write and exit the fsync handler immediately, returning success (0) to user space and resulting in data loss after a crash. This can actually be triggered deterministically and the following excerpt from a testcase I made for xfstests triggers the issue. It moves a dummy file across directories and then fsyncs the old parent directory - this is just to trigger a transaction commit, so moving files around isn't directly related to the issue but it was chosen because running 'sync' for example does more than just committing the current transaction, as it flushes/waits for all file data to be persisted. The issue can also happen at random periods, since the transaction kthread periodicaly commits the current transaction (about every 30 seconds by default). The body of the test is: _scratch_mkfs >> $seqres.full 2>&1 _init_flakey _mount_flakey # Create our main test file 'foo', the one we check for data loss. # By doing an fsync against our file, it makes btrfs clear the 'needs_full_sync' # bit from its flags (btrfs inode specific flags). $XFS_IO_PROG -f -c "pwrite -S 0xaa 0 8K" \ -c "fsync" $SCRATCH_MNT/foo | _filter_xfs_io # Now create one other file and 2 directories. We will move this second file # from one directory to the other later because it forces btrfs to commit its # currently open transaction if we fsync the old parent directory. This is # necessary to trigger the data loss bug that affected btrfs. mkdir $SCRATCH_MNT/testdir_1 touch $SCRATCH_MNT/testdir_1/bar mkdir $SCRATCH_MNT/testdir_2 # Make sure everything is durably persisted. sync # Write more 8Kb of data to our file. $XFS_IO_PROG -c "pwrite -S 0xbb 8K 8K" $SCRATCH_MNT/foo | _filter_xfs_io # Move our 'bar' file into a new directory. mv $SCRATCH_MNT/testdir_1/bar $SCRATCH_MNT/testdir_2/bar # Fsync our first directory. Because it had a file moved into some other # directory, this made btrfs commit the currently open transaction. This is # a condition necessary to trigger the data loss bug. $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/testdir_1 # Now fsync our main test file. If the fsync succeeds, we expect the 8Kb of # data we wrote previously to be persisted and available if a crash happens. # This did not happen with btrfs, because of the transaction commit that # happened when we fsynced the parent directory. $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/foo # Simulate a crash/power loss. _load_flakey_table $FLAKEY_DROP_WRITES _unmount_flakey _load_flakey_table $FLAKEY_ALLOW_WRITES _mount_flakey # Now check that all data we wrote before are available. echo "File content after log replay:" od -t x1 $SCRATCH_MNT/foo status=0 exit The expected golden output for the test, which is what we get with this fix applied (or when running against ext3/4 and xfs), is: wrote 8192/8192 bytes at offset 0 XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec) wrote 8192/8192 bytes at offset 8192 XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec) File content after log replay: 0000000 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa * 0020000 bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb * 0040000 Without this fix applied, the output shows the test file does not have the second 8Kb extent that we successfully fsynced: wrote 8192/8192 bytes at offset 0 XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec) wrote 8192/8192 bytes at offset 8192 XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec) File content after log replay: 0000000 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa * 0020000 So fix this by skipping the fsync only if we're doing a full sync and if the inode's last_trans is <= fs_info->last_trans_committed, or if the inode is already in the log. Also remove setting the inode's last_trans in btrfs_file_write_iter since it's useless/unreliable. Also because btrfs_file_write_iter no longer sets inode->last_trans to fs_info->generation + 1, don't set last_trans to 0 if we bail out and don't bail out if last_trans is 0, otherwise something as simple as the following example wouldn't log the second write on the last fsync: 1. write to file 2. fsync file 3. fsync file |--> btrfs_inode_in_log() returns true and it set last_trans to 0 4. write to file |--> btrfs_file_write_iter() no longers sets last_trans, so it remained with a value of 0 5. fsync |--> inode->last_trans == 0, so it bails out without logging the second write A test case for xfstests will be sent soon. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-03-18btrfs: fix lost return value due to variable shadowingDavid Sterba
commit 1932b7be973b554ffe20a5bba6ffaed6fa995cdc upstream. A block-local variable stores error code but btrfs_get_blocks_direct may not return it in the end as there's a ret defined in the function scope. Fixes: d187663ef24c ("Btrfs: lock extents as we map them in DIO") Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-03-06x86, mm/ASLR: Fix stack randomization on 64-bit systemsHector Marco-Gisbert
commit 4e7c22d447bb6d7e37bfe39ff658486ae78e8d77 upstream. The issue is that the stack for processes is not properly randomized on 64 bit architectures due to an integer overflow. The affected function is randomize_stack_top() in file "fs/binfmt_elf.c": static unsigned long randomize_stack_top(unsigned long stack_top) { unsigned int random_variable = 0; if ((current->flags & PF_RANDOMIZE) && !(current->personality & ADDR_NO_RANDOMIZE)) { random_variable = get_random_int() & STACK_RND_MASK; random_variable <<= PAGE_SHIFT; } return PAGE_ALIGN(stack_top) + random_variable; return PAGE_ALIGN(stack_top) - random_variable; } Note that, it declares the "random_variable" variable as "unsigned int". Since the result of the shifting operation between STACK_RND_MASK (which is 0x3fffff on x86_64, 22 bits) and PAGE_SHIFT (which is 12 on x86_64): random_variable <<= PAGE_SHIFT; then the two leftmost bits are dropped when storing the result in the "random_variable". This variable shall be at least 34 bits long to hold the (22+12) result. These two dropped bits have an impact on the entropy of process stack. Concretely, the total stack entropy is reduced by four: from 2^28 to 2^30 (One fourth of expected entropy). This patch restores back the entropy by correcting the types involved in the operations in the functions randomize_stack_top() and stack_maxrandom_size(). The successful fix can be tested with: $ for i in `seq 1 10`; do cat /proc/self/maps | grep stack; done 7ffeda566000-7ffeda587000 rw-p 00000000 00:00 0 [stack] 7fff5a332000-7fff5a353000 rw-p 00000000 00:00 0 [stack] 7ffcdb7a1000-7ffcdb7c2000 rw-p 00000000 00:00 0 [stack] 7ffd5e2c4000-7ffd5e2e5000 rw-p 00000000 00:00 0 [stack] ... Once corrected, the leading bytes should be between 7ffc and 7fff, rather than always being 7fff. Signed-off-by: Hector Marco-Gisbert <hecmargi@upv.es> Signed-off-by: Ismael Ripoll <iripoll@upv.es> [ Rebased, fixed 80 char bugs, cleaned up commit message, added test example and CVE ] Signed-off-by: Kees Cook <keescook@chromium.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Fixes: CVE-2015-1593 Link: http://lkml.kernel.org/r/20150214173350.GA18393@www.outflux.net Signed-off-by: Borislav Petkov <bp@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-03-06jffs2: fix handling of corrupted summary lengthChen Jie
commit 164c24063a3eadee11b46575c5482b2f1417be49 upstream. sm->offset maybe wrong but magic maybe right, the offset do not have CRC. Badness at c00c7580 [verbose debug info unavailable] NIP: c00c7580 LR: c00c718c CTR: 00000014 REGS: df07bb40 TRAP: 0700 Not tainted (2.6.34.13-WR4.3.0.0_standard) MSR: 00029000 <EE,ME,CE> CR: 22084f84 XER: 00000000 TASK = df84d6e0[908] 'mount' THREAD: df07a000 GPR00: 00000001 df07bbf0 df84d6e0 00000000 00000001 00000000 df07bb58 00000041 GPR08: 00000041 c0638860 00000000 00000010 22084f88 100636c8 df814ff8 00000000 GPR16: df84d6e0 dfa558cc c05adb90 00000048 c0452d30 00000000 000240d0 000040d0 GPR24: 00000014 c05ae734 c05be2e0 00000000 00000001 00000000 00000000 c05ae730 NIP [c00c7580] __alloc_pages_nodemask+0x4d0/0x638 LR [c00c718c] __alloc_pages_nodemask+0xdc/0x638 Call Trace: [df07bbf0] [c00c718c] __alloc_pages_nodemask+0xdc/0x638 (unreliable) [df07bc90] [c00c7708] __get_free_pages+0x20/0x48 [df07bca0] [c00f4a40] __kmalloc+0x15c/0x1ec [df07bcd0] [c01fc880] jffs2_scan_medium+0xa58/0x14d0 [df07bd70] [c01ff38c] jffs2_do_mount_fs+0x1f4/0x6b4 [df07bdb0] [c020144c] jffs2_do_fill_super+0xa8/0x260 [df07bdd0] [c020230c] jffs2_fill_super+0x104/0x184 [df07be00] [c0335814] get_sb_mtd_aux+0x9c/0xec [df07be20] [c033596c] get_sb_mtd+0x84/0x1e8 [df07be60] [c0201ed0] jffs2_get_sb+0x1c/0x2c [df07be70] [c0103898] vfs_kern_mount+0x78/0x1e8 [df07bea0] [c0103a58] do_kern_mount+0x40/0x100 [df07bec0] [c011fe90] do_mount+0x240/0x890 [df07bf10] [c0120570] sys_mount+0x90/0xd8 [df07bf40] [c00110d8] ret_from_syscall+0x0/0x4 === Exception: c01 at 0xff61a34 LR = 0x100135f0 Instruction dump: 38800005 38600000 48010f41 4bfffe1c 4bfc2d15 4bfffe8c 72e90200 4082fc28 3d20c064 39298860 8809000d 68000001 <0f000000> 2f800000 419efc0c 38000001 mount: mounting /dev/mtdblock3 on /common failed: Input/output error Signed-off-by: Chen Jie <chenjie6@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: David Woodhouse <David.Woodhouse@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-03-06NFSv4.1: Fix a kfree() of uninitialised pointers in decode_cb_sequence_argsTrond Myklebust
commit d8ba1f971497c19cf80da1ea5391a46a5f9fbd41 upstream. If the call to decode_rc_list() fails due to a memory allocation error, then we need to truncate the array size to ensure that we only call kfree() on those pointer that were allocated. Reported-by: David Ramos <daramos@stanford.edu> Fixes: 4aece6a19cf7f ("nfs41: cb_sequence xdr implementation") Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-03-06nfs: don't call blocking operations while !TASK_RUNNINGJeff Layton
commit 6ffa30d3f734d4f6b478081dfc09592021028f90 upstream. Bruce reported seeing this warning pop when mounting using v4.1: ------------[ cut here ]------------ WARNING: CPU: 1 PID: 1121 at kernel/sched/core.c:7300 __might_sleep+0xbd/0xd0() do not call blocking ops when !TASK_RUNNING; state=1 set at [<ffffffff810ff58f>] prepare_to_wait+0x2f/0x90 Modules linked in: rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace sunrpc fscache ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 xt_conntrack ebtable_nat ebtable_broute bridge stp llc ebtable_filter ebtables ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_security ip6table_raw ip6table_filter ip6_tables iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_security iptable_raw snd_hda_codec_generic snd_hda_intel snd_hda_controller snd_hda_codec snd_hwdep snd_pcm snd_timer ppdev joydev snd virtio_console virtio_balloon pcspkr serio_raw parport_pc parport pvpanic floppy soundcore i2c_piix4 virtio_blk virtio_net qxl drm_kms_helper ttm drm virtio_pci virtio_ring ata_generic virtio pata_acpi CPU: 1 PID: 1121 Comm: nfsv4.1-svc Not tainted 3.19.0-rc4+ #25 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20140709_153950- 04/01/2014 0000000000000000 000000004e5e3f73 ffff8800b998fb48 ffffffff8186ac78 0000000000000000 ffff8800b998fba0 ffff8800b998fb88 ffffffff810ac9da ffff8800b998fb68 ffffffff81c923e7 00000000000004d9 0000000000000000 Call Trace: [<ffffffff8186ac78>] dump_stack+0x4c/0x65 [<ffffffff810ac9da>] warn_slowpath_common+0x8a/0xc0 [<ffffffff810aca65>] warn_slowpath_fmt+0x55/0x70 [<ffffffff810ff58f>] ? prepare_to_wait+0x2f/0x90 [<ffffffff810ff58f>] ? prepare_to_wait+0x2f/0x90 [<ffffffff810dd2ad>] __might_sleep+0xbd/0xd0 [<ffffffff8124c973>] kmem_cache_alloc_trace+0x243/0x430 [<ffffffff810d941e>] ? groups_alloc+0x3e/0x130 [<ffffffff810d941e>] groups_alloc+0x3e/0x130 [<ffffffffa0301b1e>] svcauth_unix_accept+0x16e/0x290 [sunrpc] [<ffffffffa0300571>] svc_authenticate+0xe1/0xf0 [sunrpc] [<ffffffffa02fc564>] svc_process_common+0x244/0x6a0 [sunrpc] [<ffffffffa02fd044>] bc_svc_process+0x1c4/0x260 [sunrpc] [<ffffffffa03d5478>] nfs41_callback_svc+0x128/0x1f0 [nfsv4] [<ffffffff810ff970>] ? wait_woken+0xc0/0xc0 [<ffffffffa03d5350>] ? nfs4_callback_svc+0x60/0x60 [nfsv4] [<ffffffff810d45bf>] kthread+0x11f/0x140 [<ffffffff810ea815>] ? local_clock+0x15/0x30 [<ffffffff810d44a0>] ? kthread_create_on_node+0x250/0x250 [<ffffffff81874bfc>] ret_from_fork+0x7c/0xb0 [<ffffffff810d44a0>] ? kthread_create_on_node+0x250/0x250 ---[ end trace 675220a11e30f4f2 ]--- nfs41_callback_svc does most of its work while in TASK_INTERRUPTIBLE, which is just wrong. Fix that by finishing the wait immediately if we've found that the list has something on it. Also, we don't expect this kthread to accept signals, so we should be using a TASK_UNINTERRUPTIBLE sleep instead. That however, opens us up hung task warnings from the watchdog, so have the schedule_timeout wake up every 60s if there's no callback activity. Reported-by: "J. Bruce Fields" <bfields@fieldses.org> Signed-off-by: Jeff Layton <jlayton@primarydata.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-03-06xfs: set superblock buffer type correctlyDave Chinner
commit 3443a3bca54588f43286b725d8648d33a38c86f1 upstream. When the superblock is modified in a transaction, the commonly modified fields are not actually copied to the superblock buffer to avoid the buffer lock becoming a serialisation point. However, there are some other operations that modify the superblock fields within the transaction that don't directly log to the superblock but rely on the changes to be applied during the transaction commit (to minimise the buffer lock hold time). When we do this, we fail to mark the buffer log item as being a superblock buffer and that can lead to the buffer not being marked with the corect type in the log and hence causing recovery issues. Fix it by setting the type correctly, similar to xfs_mod_sb()... Tested-by: Jan Kara <jack@suse.cz> Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-03-06xfs: inode unlink does not set AGI buffer typeDave Chinner
commit f19b872b086711bb4b22c3a0f52f16aa920bcc61 upstream. This leads to log recovery throwing errors like: XFS (md0): Mounting V5 Filesystem XFS (md0): Starting recovery (logdev: internal) XFS (md0): Unknown buffer type 0! XFS (md0): _xfs_buf_ioapply: no ops on block 0xaea8802/0x1 ffff8800ffc53800: 58 41 47 49 ..... Which is the AGI buffer magic number. Ensure that we set the type appropriately in both unlink list addition and removal. Tested-by: Jan Kara <jack@suse.cz> Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-03-06xfs: ensure buffer types are set correctlyDave Chinner
commit 0d612fb570b71ea2e49554a770cff4c489018b2c upstream. Jan Kara reported that log recovery was finding buffers with invalid types in them. This should not happen, and indicates a bug in the logging of buffers. To catch this, add asserts to the buffer formatting code to ensure that the buffer type is in range when the transaction is committed. We don't set a type on buffers being marked stale - they are not going to get replayed, the format item exists only for recovery to be able to prevent replay of the buffer, so the type does not matter. Hence that needs special casing here. Reported-by: Jan Kara <jack@suse.cz> Tested-by: Jan Kara <jack@suse.cz> Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-02-11ext4: prevent bugon on race between write/fcntlDmitry Monakhov
commit a41537e69b4aa43f0fea02498c2595a81267383b upstream. O_DIRECT flags can be toggeled via fcntl(F_SETFL). But this value checked twice inside ext4_file_write_iter() and __generic_file_write() which result in BUG_ON inside ext4_direct_IO. Let's initialize iocb->private unconditionally. TESTCASE: xfstest:generic/036 https://patchwork.ozlabs.org/patch/402445/ #TYPICAL STACK TRACE: kernel BUG at fs/ext4/inode.c:2960! invalid opcode: 0000 [#1] SMP Modules linked in: brd iTCO_wdt lpc_ich mfd_core igb ptp dm_mirror dm_region_hash dm_log dm_mod CPU: 6 PID: 5505 Comm: aio-dio-fcntl-r Not tainted 3.17.0-rc2-00176-gff5c017 #161 Hardware name: Intel Corporation W2600CR/W2600CR, BIOS SE5C600.86B.99.99.x028.061320111235 06/13/2011 task: ffff88080e95a7c0 ti: ffff88080f908000 task.ti: ffff88080f908000 RIP: 0010:[<ffffffff811fabf2>] [<ffffffff811fabf2>] ext4_direct_IO+0x162/0x3d0 RSP: 0018:ffff88080f90bb58 EFLAGS: 00010246 RAX: 0000000000000400 RBX: ffff88080fdb2a28 RCX: 00000000a802c818 RDX: 0000040000080000 RSI: ffff88080d8aeb80 RDI: 0000000000000001 RBP: ffff88080f90bbc8 R08: 0000000000000000 R09: 0000000000001581 R10: 0000000000000000 R11: 0000000000000000 R12: ffff88080d8aeb80 R13: ffff88080f90bbf8 R14: ffff88080fdb28c8 R15: ffff88080fdb2a28 FS: 00007f23b2055700(0000) GS:ffff880818400000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f23b2045000 CR3: 000000080cedf000 CR4: 00000000000407e0 Stack: ffff88080f90bb98 0000000000000000 7ffffffffffffffe ffff88080fdb2c30 0000000000000200 0000000000000200 0000000000000001 0000000000000200 ffff88080f90bbc8 ffff88080fdb2c30 ffff88080f90be08 0000000000000200 Call Trace: [<ffffffff8112ca9d>] generic_file_direct_write+0xed/0x180 [<ffffffff8112f2b2>] __generic_file_write_iter+0x222/0x370 [<ffffffff811f495b>] ext4_file_write_iter+0x34b/0x400 [<ffffffff811bd709>] ? aio_run_iocb+0x239/0x410 [<ffffffff811bd709>] ? aio_run_iocb+0x239/0x410 [<ffffffff810990e5>] ? local_clock+0x25/0x30 [<ffffffff810abd94>] ? __lock_acquire+0x274/0x700 [<ffffffff811f4610>] ? ext4_unwritten_wait+0xb0/0xb0 [<ffffffff811bd756>] aio_run_iocb+0x286/0x410 [<ffffffff810990e5>] ? local_clock+0x25/0x30 [<ffffffff810ac359>] ? lock_release_holdtime+0x29/0x190 [<ffffffff811bc05b>] ? lookup_ioctx+0x4b/0xf0 [<ffffffff811bde3b>] do_io_submit+0x55b/0x740 [<ffffffff811bdcaa>] ? do_io_submit+0x3ca/0x740 [<ffffffff811be030>] SyS_io_submit+0x10/0x20 [<ffffffff815ce192>] system_call_fastpath+0x16/0x1b Code: 01 48 8b 80 f0 01 00 00 48 8b 18 49 8b 45 10 0f 85 f1 01 00 00 48 03 45 c8 48 3b 43 48 0f 8f e3 01 00 00 49 83 7c 24 18 00 75 04 <0f> 0b eb fe f0 ff 83 ec 01 00 00 49 8b 44 24 18 8b 00 85 c0 89 RIP [<ffffffff811fabf2>] ext4_direct_IO+0x162/0x3d0 RSP <ffff88080f90bb58> Reported-by: Sasha Levin <sasha.levin@oracle.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> [hujianyang: Backported to 3.10 - Move initialization of iocb->private to ext4_file_write() as we don't have ext4_file_write_iter(), which is introduced by commit 9b884164. - Adjust context to make 'overwrite' changes apply to ext4_file_dio_write() as ext4_file_dio_write() is not move into ext4_file_write()] Signed-off-by: hujianyang <hujianyang@huawei.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-02-11nilfs2: fix deadlock of segment constructor over I_SYNC flagRyusuke Konishi
commit 7ef3ff2fea8bf5e4a21cef47ad87710a3d0fdb52 upstream. Nilfs2 eventually hangs in a stress test with fsstress program. This issue was caused by the following deadlock over I_SYNC flag between nilfs_segctor_thread() and writeback_sb_inodes(): nilfs_segctor_thread() nilfs_segctor_thread_construct() nilfs_segctor_unlock() nilfs_dispose_list() iput() iput_final() evict() inode_wait_for_writeback() * wait for I_SYNC flag writeback_sb_inodes() * set I_SYNC flag on inode->i_state __writeback_single_inode() do_writepages() nilfs_writepages() nilfs_construct_dsync_segment() nilfs_segctor_sync() * wait for completion of segment constructor inode_sync_complete() * clear I_SYNC flag after __writeback_single_inode() completed writeback_sb_inodes() calls do_writepages() for dirty inodes after setting I_SYNC flag on inode->i_state. do_writepages() in turn calls nilfs_writepages(), which can run segment constructor and wait for its completion. On the other hand, segment constructor calls iput(), which can call evict() and wait for the I_SYNC flag on inode_wait_for_writeback(). Since segment constructor doesn't know when I_SYNC will be set, it cannot know whether iput() will block or not unless inode->i_nlink has a non-zero count. We can prevent evict() from being called in iput() by implementing sop->drop_inode(), but it's not preferable to leave inodes with i_nlink == 0 for long periods because it even defers file truncation and inode deallocation. So, this instead resolves the deadlock by calling iput() asynchronously with a workqueue for inodes with i_nlink == 0. Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp> Cc: Al Viro <viro@zeniv.linux.org.uk> Tested-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-02-05pstore/ram: avoid atomic accesses for ioremapped regionsRob Herring
commit 0405a5cec3406f19e69da07c8111a6bf1088ac29 upstream. For persistent RAM outside of main memory, the memory may have limitations on supported accesses. For internal RAM on highbank platform exclusive accesses are not supported and will hang the system. So atomic_cmpxchg cannot be used. This commit uses spinlock protection for buffer size and start updates on ioremapped regions instead. Signed-off-by: Rob Herring <rob.herring@calxeda.com> Acked-by: Anton Vorontsov <anton@enomsg.org> Signed-off-by: Tony Luck <tony.luck@intel.com> [hkp: Backported to 3.10: adjust context] Signed-off-by: HuKeping <hukeping@huawei.com>
2015-02-05pstore: Fix NULL pointer fault if get NULL prz in ramoops_get_next_przLiu ShuoX
commit b0aa931fb84431394d995472d0af2a6c2b61064d upstream. ramoops_get_next_prz get the prz according the paramters. If it get a uninitialized prz, access its members by following persistent_ram_old_size(prz) will cause a NULL pointer crash. Ex: if ftrace_size is 0, fprz will be NULL. Fix it by return NULL in advance. Signed-off-by: Liu ShuoX <shuox.liu@intel.com> Acked-by: Kees Cook <keescook@chromium.org> Signed-off-by: Tony Luck <tony.luck@intel.com> Cc: HuKeping <hukeping@huawei.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-02-05pstore: skip zero size persistent ram buffer in traverseLiu ShuoX
commit aa9a4a1edfbd3d223af01db833da2f07850bc655 upstream. In ramoops_pstore_read, a valid prz pointer with zero size buffer will break traverse of all persistent ram buffers. The latter buffer might be lost. Signed-off-by: Liu ShuoX <shuox.liu@intel.com> Cc: "Zhang, Yanmin" <yanmin_zhang@linux.intel.com> Cc: Colin Cross <ccross@android.com> Reviewed-by: Kees Cook <keescook@chromium.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Tony Luck <tony.luck@intel.com> Cc: HuKeping <hukeping@huawei.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-02-05pstore: clarify clearing of _read_cnt in ramoops_contextLiu ShuoX
commit 57fd835385a043577457a385f28c08be693991bf upstream. *_read_cnt in ramoops_context need to be cleared during pstore ->open to support mutli times getting the records. The patch added missed ftrace_read_cnt clearing and removed duplicate clearing in ramoops_probe. Signed-off-by: Liu ShuoX <shuox.liu@intel.com> Cc: "Zhang, Yanmin" <yanmin_zhang@linux.intel.com> Cc: Colin Cross <ccross@android.com> Cc: Kees Cook <keescook@chromium.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Tony Luck <tony.luck@intel.com> Cc: HuKeping <hukeping@huawei.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-02-05pstore: d_alloc_name() doesn't return an ERR_PTRDan Carpenter
commit c39524e6744284452ef45480d3153bec28960c32 upstream. d_alloc_name() returns NULL on error. Also I changed the error code from -ENOSPC to -ENOMEM to reflect that we were short on RAM not disk space. Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Acked-by: Kees Cook <keescook@chromium.org> Signed-off-by: Tony Luck <tony.luck@intel.com> Cc: HuKeping <hukeping@huawei.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-02-05pstore: Fail to unlink if a driver has not defined pstore_eraseAruna Balakrishnaiah
commit bf2883339a33b7544b92ea465b90c3de55082032 upstream. pstore_erase is used to erase the record from the persistent store. So if a driver has not defined pstore_erase callback return -EPERM instead of unlinking a file as deleting the file without erasing its record in persistent store will give a wrong impression to customers. Signed-off-by: Aruna Balakrishnaiah <aruna@linux.vnet.ibm.com> Acked-by: Kees Cook <keescook@chromium.org> Signed-off-by: Tony Luck <tony.luck@intel.com> Cc: HuKeping <hukeping@huawei.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-02-05NFSv4.1: Fix an Oops in nfs41_walk_client_listTrond Myklebust
commit 3175e1dcec40fab1a444c010087f2068b6b04732 upstream. If we start state recovery on a client that failed to initialise correctly, then we are very likely to Oops. Reported-by: "Mkrtchyan, Tigran" <tigran.mkrtchyan@desy.de> Link: http://lkml.kernel.org/r/130621862.279655.1421851650684.JavaMail.zimbra@desy.de Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-02-05nfs: fix dio deadlock when O_DIRECT flag is flippedPeng Tao
commit ee8a1a8b160a87dc3a9c81a86796aa4db85ea815 upstream. We only support swap file calling nfs_direct_IO. However, application might be able to get to nfs_direct_IO if it toggles O_DIRECT flag during IO and it can deadlock because we grab inode->i_mutex in nfs_file_direct_write(). So return 0 for such case. Then the generic layer will fall back to buffer IO. Signed-off-by: Peng Tao <tao.peng@primarydata.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-01-29ext4: fix warning in ext4_da_update_reserve_space()Jan Kara
commit 7d7345322d60edb0fa49a64a89b31360f01d09cb upstream. reaim workfile.dbase test easily triggers warning in ext4_da_update_reserve_space(): EXT4-fs warning (device ram0): ext4_da_update_reserve_space:365: ino 12, allocated 1 with only 0 reserved metadata blocks (releasing 1 blocks with reserved 9 data blocks) The problem is that (one of) tests creates file and then randomly writes to it with O_SYNC. That results in writing back pages of the file in random order so we create extents for written blocks say 0, 2, 4, 6, 8 - this last allocation also allocates new block for extents. Then we writeout block 1 so we have extents 0-2, 4, 6, 8 and we release indirect extent block because extents fit in the inode again. Then we writeout block 10 and we need to allocate indirect extent block again which triggers the warning because we don't have the reservation anymore. Fix the problem by giving back freed metadata blocks resulting from extent merging into inode's reservation pool. Signed-off-by: Jan Kara <jack@suse.cz> Cc: Josh Hunt <johunt@akamai.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-01-29quota: provide interface for readding allocated space into reserved spaceJan Kara
commit 1c8924eb106c1ac755d5d35ce9b3ff42e89e2511 upstream. ext4 needs to convert allocated (metadata) blocks back into blocks reserved for delayed allocation. Add functions into quota code for supporting such operation. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: Josh Hunt <johunt@akamai.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-01-27fsnotify: next_i is freed during fsnotify_unmount_inodes.Jerry Hoemann
commit 6424babfd68dd8a83d9c60a5242d27038856599f upstream. During file system stress testing on 3.10 and 3.12 based kernels, the umount command occasionally hung in fsnotify_unmount_inodes in the section of code: spin_lock(&inode->i_lock); if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) { spin_unlock(&inode->i_lock); continue; } As this section of code holds the global inode_sb_list_lock, eventually the system hangs trying to acquire the lock. Multiple crash dumps showed: The inode->i_state == 0x60 and i_count == 0 and i_sb_list would point back at itself. As this is not the value of list upon entry to the function, the kernel never exits the loop. To help narrow down problem, the call to list_del_init in inode_sb_list_del was changed to list_del. This poisons the pointers in the i_sb_list and causes a kernel to panic if it transverse a freed inode. Subsequent stress testing paniced in fsnotify_unmount_inodes at the bottom of the list_for_each_entry_safe loop showing next_i had become free. We believe the root cause of the problem is that next_i is being freed during the window of time that the list_for_each_entry_safe loop temporarily releases inode_sb_list_lock to call fsnotify and fsnotify_inode_delete. The code in fsnotify_unmount_inodes attempts to prevent the freeing of inode and next_i by calling __iget. However, the code doesn't do the __iget call on next_i if i_count == 0 or if i_state & (I_FREEING | I_WILL_FREE) The patch addresses this issue by advancing next_i in the above two cases until we either find a next_i which we can __iget or we reach the end of the list. This makes the handling of next_i more closely match the handling of the variable "inode." The time to reproduce the hang is highly variable (from hours to days.) We ran the stress test on a 3.10 kernel with the proposed patch for a week without failure. During list_for_each_entry_safe, next_i is becoming free causing the loop to never terminate. Advance next_i in those cases where __iget is not done. Signed-off-by: Jerry Hoemann <jerry.hoemann@hp.com> Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com> Cc: Ken Helias <kenhelias@firemail.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Jan Kara <jack@suse.cz> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-01-27LOCKD: Fix a race when initialising nlmsvc_timeoutTrond Myklebust
commit 06bed7d18c2c07b3e3eeadf4bd357f6e806618cc upstream. This commit fixes a race whereby nlmclnt_init() first starts the lockd daemon, and then calls nlm_bind_host() with the expectation that nlmsvc_timeout has already been initialised. Unfortunately, there is no no synchronisation between lockd() and lockd_up() to guarantee that this is the case. Fix is to move the initialisation of nlmsvc_timeout into lockd_create_svc Fixes: 9a1b6bf818e74 ("LOCKD: Don't call utsname()->nodename...") Cc: Bruce Fields <bfields@fieldses.org> Cc: stable@vger.kernel.org # 3.10.x Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-01-27NFSv4.1: Fix client id trunking on LinuxTrond Myklebust
commit 1fc0703af3143914a389bfa081c7acb09502ed5d upstream. Currently, our trunking code will check for session trunking, but will fail to detect client id trunking. This is a problem, because it means that the client will fail to recognise that the two connections represent shared state, even if they do not permit a shared session. By removing the check for the server minor id, and only checking the major id, we will end up doing the right thing in both cases: we close down the new nfs_client and fall back to using the existing one. Fixes: 05f4c350ee02e ("NFS: Discover NFSv4 server trunking when mounting") Cc: Chuck Lever <chuck.lever@oracle.com> Tested-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-01-16Btrfs: don't delay inode ref updates during log replayChris Mason
commit 6f8960541b1eb6054a642da48daae2320fddba93 upstream. Commit 1d52c78afbb (Btrfs: try not to ENOSPC on log replay) added a check to skip delayed inode updates during log replay because it confuses the enospc code. But the delayed processing will end up ignoring delayed refs from log replay because the inode itself wasn't put through the delayed code. This can end up triggering a warning at commit time: WARNING: CPU: 2 PID: 778 at fs/btrfs/delayed-inode.c:1410 btrfs_assert_delayed_root_empty+0x32/0x34() Which is repeated for each commit because we never process the delayed inode ref update. The fix used here is to change btrfs_delayed_delete_inode_ref to return an error if we're currently in log replay. The caller will do the ref deletion immediately and everything will work properly. Signed-off-by: Chris Mason <clm@fb.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>