summaryrefslogtreecommitdiff
path: root/fs/namei.c
AgeCommit message (Collapse)Author
2022-07-06step_into(): move fetching ->d_inode past handle_mounts()Al Viro
... and lose messing with it in __follow_mount_rcu() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2022-07-06lookup_fast(): don't bother with inodeAl Viro
Note that validation of ->d_seq after ->d_inode fetch is gone, along with fetching of ->d_inode itself. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2022-07-06follow_dotdot{,_rcu}(): don't bother with inodeAl Viro
step_into() will fetch it, TYVM. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2022-07-06step_into(): lose inode argumentAl Viro
make handle_mounts() always fetch it. This is just the first step - the callers of step_into() will stop trying to calculate the sucker, etc. The passed value should be equal to dentry->d_inode in all cases; in RCU mode - fetched after we'd sampled ->d_seq. Might as well fetch it here. We do need to validate ->d_seq, which duplicates the check currently done in lookup_fast(); that duplication will go away shortly. After that change handle_mounts() always ignores the initial value of *inode and always sets it on success. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2022-07-06namei: stash the sampled ->d_seq into nameidataAl Viro
New field: nd->next_seq. Set to 0 outside of RCU mode, holds the sampled value for the next dentry to be considered. Used instead of an arseload of local variables, arguments, etc. step_into() has lost seq argument; nd->next_seq is used, so dentry passed to it must be the one ->next_seq is about. There are two requirements for RCU pathwalk: 1) it should not give a hard failure (other than -ECHILD) unless non-RCU pathwalk might fail that way given suitable timings. 2) it should not succeed unless non-RCU pathwalk might succeed with the same end location given suitable timings. The use of seq numbers is the way we achieve that. Invariant we want to maintain is: if RCU pathwalk can reach the state with given nd->path, nd->inode and nd->seq after having traversed some part of pathname, it must be possible for non-RCU pathwalk to reach the same nd->path and nd->inode after having traversed the same part of pathname, and observe the nd->path.dentry->d_seq equal to what RCU pathwalk has in nd->seq For transition from parent to child, we sample child's ->d_seq and verify that parent's ->d_seq remains unchanged. Anything that disrupts parent-child relationship would've bumped ->d_seq on both. For transitions from child to parent we sample parent's ->d_seq and verify that child's ->d_seq has not changed. Same reasoning as for the previous case applies. For transition from mountpoint to root of mounted we sample the ->d_seq of root and verify that nobody has touched mount_lock since the beginning of pathwalk. That guarantees that mount we'd found had been there all along, with these mountpoint and root of the mounted. It would be possible for a non-RCU pathwalk to reach the previous state, find the same mount and observe its root at the moment we'd sampled ->d_seq of that For transitions from root of mounted to mountpoint we sample ->d_seq of mountpoint and verify that mount_lock had not been touched since the beginning of pathwalk. The same reasoning as in the previous case applies. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2022-07-06namei: move clearing LOOKUP_RCU towards rcu_read_unlock()Al Viro
try_to_unlazy()/try_to_unlazy_next() drop LOOKUP_RCU in the very beginning and do rcu_read_unlock() only at the very end. However, nothing done in between even looks at the flag in question; might as well clear it at the same time we unlock. Note that try_to_unlazy_next() used to call legitimize_mnt(), which might drop/regain rcu_read_lock() in some cases. This is no longer true, so we really have rcu_read_lock() held all along until the end. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2022-07-05switch try_to_unlazy_next() to __legitimize_mnt()Al Viro
The tricky case (__legitimize_mnt() failing after having grabbed a reference) can be trivially dealt with by leaving nd->path.mnt non-NULL, for terminate_walk() to drop it. legitimize_mnt() becomes static after that. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2022-07-05follow_dotdot{,_rcu}(): change calling conventionsAl Viro
Instead of returning NULL when we are in root, just make it return the current position (and set *seqp and *inodep accordingly). That collapses the calls of step_into() in handle_dots() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2022-07-05namei: get rid of pointless unlikely(read_seqcount_retry(...))Al Viro
read_seqcount_retry() et.al. are inlined and there's enough annotations for compiler to figure out that those are unlikely to return non-zero. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2022-07-05__follow_mount_rcu(): verify that mount_lock remains unchangedAl Viro
Validate mount_lock seqcount as soon as we cross into mount in RCU mode. Sure, ->mnt_root is pinned and will remain so until we do rcu_read_unlock() anyway, and we will eventually fail to unlazy if the mount_lock had been touched, but we might run into a hard error (e.g. -ENOENT) before trying to unlazy. And it's possible to end up with RCU pathwalk racing with rename() and umount() in a way that would fail with -ENOENT while non-RCU pathwalk would've succeeded with any timings. Once upon a time we hadn't needed that, but analysis had been subtle, brittle and went out of window as soon as RENAME_EXCHANGE had been added. It's narrow, hard to hit and won't get you anything other than stray -ENOENT that could be arranged in much easier way with the same priveleges, but it's a bug all the same. Cc: stable@kernel.org X-sky-is-falling: unlikely Fixes: da1ce0670c14 "vfs: add cross-rename" Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2022-06-04Merge tag 'pull-18-rc1-work.namei' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs pathname updates from Al Viro: "Several cleanups in fs/namei.c" * tag 'pull-18-rc1-work.namei' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: namei: cleanup double word in comment get rid of dead code in legitimize_root() fs/namei.c:reserve_stack(): tidy up the call of try_to_unlazy()
2022-05-30Merge tag 'ovl-update-5.19' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs Pull overlayfs updates from Miklos Szeredi: - Support idmapped layers in overlayfs (Christian Brauner) - Add a fix to exportfs that is relevant to open_by_handle_at(2) as well - Introduce new lookup helpers that allow passing mnt_userns into inode_permission() * tag 'ovl-update-5.19' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs: ovl: support idmapped layers ovl: handle idmappings in ovl_xattr_{g,s}et() ovl: handle idmappings in layer open helpers ovl: handle idmappings in ovl_permission() ovl: use ovl_copy_{real,upper}attr() wrappers ovl: store lower path in ovl_inode ovl: handle idmappings for layer lookup ovl: handle idmappings for layer fileattrs ovl: use ovl_path_getxattr() wrapper ovl: use ovl_lookup_upper() wrapper ovl: use ovl_do_notify_change() wrapper ovl: pass layer mnt to ovl_open_realfile() ovl: pass ofs to setattr operations ovl: handle idmappings in creation operations ovl: add ovl_upper_mnt_userns() wrapper ovl: pass ofs to creation operations ovl: use wrappers to all vfs_*xattr() calls exportfs: support idmapped mounts fs: add two trivial lookup helpers
2022-05-27Merge tag 'mm-nonmm-stable-2022-05-26' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull misc updates from Andrew Morton: "The non-MM patch queue for this merge window. Not a lot of material this cycle. Many singleton patches against various subsystems. Most notably some maintenance work in ocfs2 and initramfs" * tag 'mm-nonmm-stable-2022-05-26' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (65 commits) kcov: update pos before writing pc in trace function ocfs2: dlmfs: fix error handling of user_dlm_destroy_lock ocfs2: dlmfs: don't clear USER_LOCK_ATTACHED when destroying lock fs/ntfs: remove redundant variable idx fat: remove time truncations in vfat_create/vfat_mkdir fat: report creation time in statx fat: ignore ctime updates, and keep ctime identical to mtime in memory fat: split fat_truncate_time() into separate functions MAINTAINERS: add Muchun as a memcg reviewer proc/sysctl: make protected_* world readable ia64: mca: drop redundant spinlock initialization tty: fix deadlock caused by calling printk() under tty_port->lock relay: remove redundant assignment to pointer buf fs/ntfs3: validate BOOT sectors_per_clusters lib/string_helpers: fix not adding strarray to device's resource list kernel/crash_core.c: remove redundant check of ck_cmdline ELF, uapi: fixup ELF_ST_TYPE definition ipc/mqueue: use get_tree_nodev() in mqueue_get_tree() ipc: update semtimedop() to use hrtimer ipc/sem: remove redundant assignments ...
2022-05-19namei: cleanup double word in commentTom Rix
Remove the second 'to'. Signed-off-by: Tom Rix <trix@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2022-05-19get rid of dead code in legitimize_root()Al Viro
Combination of LOOKUP_IS_SCOPED and NULL nd->root.mnt is impossible after successful path_init(). All places where ->root.mnt might become NULL do that only if LOOKUP_IS_SCOPED is not there and path_init() itself can return success without setting nd->root only if ND_ROOT_PRESET had been set (in which case nd->root had been set by caller and never changed) or if the name had been a relative one *and* none of the bits in LOOKUP_IS_SCOPED had been present. Since all calls of legitimize_root() must be downstream of successful path_init(), the check for !nd->root.mnt && (nd->flags & LOOKUP_IS_SCOPED) is pure paranoia. FWIW, it had been discussed (and agreed upon) with Aleksa back when scoped lookups had been merged; looks like that had fallen through the cracks back then. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2022-05-19fs/namei.c:reserve_stack(): tidy up the call of try_to_unlazy()Al Viro
!foo() != 0 is a strange way to spell !foo(); fallout from "fs: make unlazy_walk() error handling consistent"... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2022-05-13proc/sysctl: make protected_* world readableJulius Hemanth Pitti
protected_* files have 600 permissions which prevents non-superuser from reading them. Container like "AWS greengrass" refuse to launch unless protected_hardlinks and protected_symlinks are set. When containers like these run with "userns-remap" or "--user" mapping container's root to non-superuser on host, they fail to run due to denied read access to these files. As these protections are hardly a secret, and do not possess any security risk, making them world readable. Though above greengrass usecase needs read access to only protected_hardlinks and protected_symlinks files, setting all other protected_* files to 644 to keep consistency. Link: http://lkml.kernel.org/r/20200709235115.56954-1-jpitti@cisco.com Fixes: 800179c9b8a1 ("fs: add link restrictions") Signed-off-by: Julius Hemanth Pitti <jpitti@cisco.com> Acked-by: Kees Cook <keescook@chromium.org> Acked-by: Luis Chamberlain <mcgrof@kernel.org> Cc: Iurii Zaikin <yzaikin@google.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-08namei: Call aops write_begin() and write_end() directlyMatthew Wilcox (Oracle)
pagecache_write_begin() and pagecache_write_end() are now trivial wrappers, so call the aops directly. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2022-05-08namei: Convert page_symlink() to use memalloc_nofs_save()Matthew Wilcox (Oracle)
Stop using AOP_FLAG_NOFS in favour of the scoped memory API. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2022-05-08namei: Merge page_symlink() and __page_symlink()Matthew Wilcox (Oracle)
There are no callers of __page_symlink() left, so we can remove that entry point. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Christian Brauner <brauner@kernel.org>
2022-04-28fs: add two trivial lookup helpersChristian Brauner
Similar to the addition of lookup_one() add a version of lookup_one_unlocked() and lookup_one_positive_unlocked() that take idmapped mounts into account. This is required to port overlay to support idmapped base layers. Cc: <linux-fsdevel@vger.kernel.org> Tested-by: Giuseppe Scrivano <gscrivan@redhat.com> Reviewed-by: Amir Goldstein <amir73il@gmail.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2022-04-14VFS: filename_create(): fix incorrect intent.NeilBrown
When asked to create a path ending '/', but which is not to be a directory (LOOKUP_DIRECTORY not set), filename_create() will never try to create the file. If it doesn't exist, -ENOENT is reported. However, it still passes LOOKUP_CREATE|LOOKUP_EXCL to the filesystems ->lookup() function, even though there is no intent to create. This is misleading and can cause incorrect behaviour. If you try ln -s foo /path/dir/ where 'dir' is a directory on an NFS filesystem which is not currently known in the dcache, this will fail with ENOENT. But as the name is not in the dcache, nfs_lookup gets called with LOOKUP_CREATE|LOOKUP_EXCL and so it returns NULL without performing any lookup, with the expectation that a subsequent call to create the target will be made, and the lookup can be combined with the creation. In the case with a trailing '/' and no LOOKUP_DIRECTORY, that call is never made. Instead filename_create() sees that the dentry is not (yet) positive and returns -ENOENT - even though the directory actually exists. So only set LOOKUP_CREATE|LOOKUP_EXCL if there really is an intent to create, and use the absence of these flags to decide if -ENOENT should be returned. Note that filename_parentat() is only interested in LOOKUP_REVAL, so we split that out and store it in 'reval_flag'. __lookup_hash() then gets reval_flag combined with whatever create flags were determined to be needed. Reviewed-by: David Disseldorp <ddiss@suse.de> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: NeilBrown <neilb@suse.de> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-01-28Merge tag 'fsnotify_for_v5.17-rc2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs Pull fsnotify fixes from Jan Kara: "Fixes for userspace breakage caused by fsnotify changes ~3 years ago and one fanotify cleanup" * tag 'fsnotify_for_v5.17-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs: fsnotify: fix fsnotify hooks in pseudo filesystems fsnotify: invalidate dcache before IN_DELETE event fanotify: remove variable set but not used
2022-01-24fsnotify: invalidate dcache before IN_DELETE eventAmir Goldstein
Apparently, there are some applications that use IN_DELETE event as an invalidation mechanism and expect that if they try to open a file with the name reported with the delete event, that it should not contain the content of the deleted file. Commit 49246466a989 ("fsnotify: move fsnotify_nameremove() hook out of d_delete()") moved the fsnotify delete hook before d_delete() so fsnotify will have access to a positive dentry. This allowed a race where opening the deleted file via cached dentry is now possible after receiving the IN_DELETE event. To fix the regression, create a new hook fsnotify_delete() that takes the unlinked inode as an argument and use a helper d_delete_notify() to pin the inode, so we can pass it to fsnotify_delete() after d_delete(). Backporting hint: this regression is from v5.3. Although patch will apply with only trivial conflicts to v5.4 and v5.10, it won't build, because fsnotify_delete() implementation is different in each of those versions (see fsnotify_link()). A follow up patch will fix the fsnotify_unlink/rmdir() calls in pseudo filesystem that do not need to call d_delete(). Link: https://lore.kernel.org/r/20220120215305.282577-1-amir73il@gmail.com Reported-by: Ivan Delalande <colona@arista.com> Link: https://lore.kernel.org/linux-fsdevel/YeNyzoDM5hP5LtGW@visor/ Fixes: 49246466a989 ("fsnotify: move fsnotify_nameremove() hook out of d_delete()") Cc: stable@vger.kernel.org # v5.3+ Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Jan Kara <jack@suse.cz>
2022-01-22fs: move namei sysctls to its own fileLuis Chamberlain
kernel/sysctl.c is a kitchen sink where everyone leaves their dirty dishes, this makes it very difficult to maintain. To help with this maintenance let's start by moving sysctls to places where they actually belong. The proc sysctl maintainers do not want to know what sysctl knobs you wish to add for your own piece of code, we just care about the core logic. So move namei's own sysctl knobs to its own file. Other than the move we also avoid initializing two static variables to 0 as this is not needed: * sysctl_protected_symlinks * sysctl_protected_hardlinks Link: https://lkml.kernel.org/r/20211129205548.605569-8-mcgrof@kernel.org Signed-off-by: Luis Chamberlain <mcgrof@kernel.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Cc: Antti Palosaari <crope@iki.fi> Cc: Eric Biederman <ebiederm@xmission.com> Cc: Iurii Zaikin <yzaikin@google.com> Cc: "J. Bruce Fields" <bfields@fieldses.org> Cc: Jeff Layton <jlayton@kernel.org> Cc: Kees Cook <keescook@chromium.org> Cc: Lukas Middendorf <kernel@tuxforce.de> Cc: Stephen Kitt <steve@sk2.org> Cc: Xiaoming Ni <nixiaoming@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-01-07vfs, cachefiles: Mark a backing file in use with an inode flagDavid Howells
Use an inode flag, S_KERNEL_FILE, to mark that a backing file is in use by the kernel to prevent cachefiles or other kernel services from interfering with that file. Alter rmdir to reject attempts to remove a directory marked with this flag. This is used by cachefiles to prevent cachefilesd from removing them. Using S_SWAPFILE instead isn't really viable as that has other effects in the I/O paths. Changes ======= ver #3: - Check for the object pointer being NULL in the tracepoints rather than the caller. Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> cc: linux-cachefs@redhat.com Link: https://lore.kernel.org/r/163819630256.215744.4815885535039369574.stgit@warthog.procyon.org.uk/ # v1 Link: https://lore.kernel.org/r/163906931596.143852.8642051223094013028.stgit@warthog.procyon.org.uk/ # v2 Link: https://lore.kernel.org/r/163967141000.1823006.12920680657559677789.stgit@warthog.procyon.org.uk/ # v3 Link: https://lore.kernel.org/r/164021541207.640689.564689725898537127.stgit@warthog.procyon.org.uk/ # v4
2021-11-01Merge tag 'locks-v5.16' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux Pull file locking updates from Jeff Layton: "Most of this is just follow-on cleanup work of documentation and comments from the mandatory locking removal in v5.15. The only real functional change is that LOCK_MAND flock() support is also being removed, as it has basically been non-functional since the v2.5 days" * tag 'locks-v5.16' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux: fs: remove leftover comments from mandatory locking removal locks: remove changelog comments docs: fs: locks.rst: update comment about mandatory file locking Documentation: remove reference to now removed mandatory-locking doc locks: remove LOCK_MAND flock lock support
2021-10-26fs: remove leftover comments from mandatory locking removalJeff Layton
Stragglers from commit f7e33bdbd6d1 ("fs: remove mandatory file locking support"). Signed-off-by: Jeff Layton <jlayton@kernel.org>
2021-09-07putname(): IS_ERR_OR_NULL() is wrong hereAl Viro
Mixing NULL and ERR_PTR() just in case is a Bad Idea(tm). For struct filename the former is wrong - failures are reported as ERR_PTR(...), not as NULL. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2021-09-07namei: Standardize callers of filename_create()Stephen Brennan
filename_create() has two variants, one which drops the caller's reference to filename (filename_create) and one which does not (__filename_create). This can be confusing as it's unusual to drop a caller's reference. Remove filename_create, rename __filename_create to filename_create, and convert all callers. Link: https://lore.kernel.org/linux-fsdevel/f6238254-35bd-7e97-5b27-21050c745874@oracle.com/ Cc: Christoph Hellwig <hch@infradead.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Stephen Brennan <stephen.s.brennan@oracle.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2021-09-07namei: Standardize callers of filename_lookup()Stephen Brennan
filename_lookup() has two variants, one which drops the caller's reference to filename (filename_lookup), and one which does not (__filename_lookup). This can be confusing as it's unusual to drop a caller's reference. Remove filename_lookup, rename __filename_lookup to filename_lookup, and convert all callers. The cost is a few slightly longer functions, but the clarity is greater. [AV: consuming a reference is not at all unusual, actually; look at e.g. do_mkdirat(), for example. It's more that we want non-consuming variant for close relative of that function...] Link: https://lore.kernel.org/linux-fsdevel/YS+dstZ3xfcLxhoB@zeniv-ca.linux.org.uk/ Cc: Christoph Hellwig <hch@infradead.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Stephen Brennan <stephen.s.brennan@oracle.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2021-09-07rename __filename_parentat() to filename_parentat()Al Viro
... in separate commit, to avoid noise in previous one Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2021-09-07namei: Fix use after free in kern_path_lockedStephen Brennan
In 0ee50b47532a ("namei: change filename_parentat() calling conventions"), filename_parentat() was made to always call putname() on the filename before returning, and kern_path_locked() was migrated to this calling convention. However, kern_path_locked() uses the "last" parameter to lookup and potentially create a new dentry. The last parameter contains the last component of the path and points within the filename, which was recently freed at the end of filename_parentat(). Thus, when kern_path_locked() calls __lookup_hash(), it is using the filename after it has already been freed. In other words, these calling conventions had been wrong for the only remaining caller of filename_parentat(). Everything else is using __filename_parentat(), which does not drop the reference; so should kern_path_locked(). Switch kern_path_locked() to use of __filename_parentat() and move getting/dropping struct filename into wrapper. Remove filename_parentat(), now that we have no remaining callers. Fixes: 0ee50b47532a ("namei: change filename_parentat() calling conventions") Link: https://lore.kernel.org/linux-fsdevel/YS9D4AlEsaCxLFV0@infradead.org/ Link: https://lore.kernel.org/linux-fsdevel/YS+csMTV2tTXKg3s@zeniv-ca.linux.org.uk/ Cc: Christoph Hellwig <hch@infradead.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Reported-by: syzbot+fb0d60a179096e8c2731@syzkaller.appspotmail.com Signed-off-by: Stephen Brennan <stephen.s.brennan@oracle.com> Co-authored-by: Dmitry Kadashev <dkadashev@gmail.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2021-09-03Merge branch 'akpm' (patches from Andrew)Linus Torvalds
Merge misc updates from Andrew Morton: "173 patches. Subsystems affected by this series: ia64, ocfs2, block, and mm (debug, pagecache, gup, swap, shmem, memcg, selftests, pagemap, mremap, bootmem, sparsemem, vmalloc, kasan, pagealloc, memory-failure, hugetlb, userfaultfd, vmscan, compaction, mempolicy, memblock, oom-kill, migration, ksm, percpu, vmstat, and madvise)" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (173 commits) mm/madvise: add MADV_WILLNEED to process_madvise() mm/vmstat: remove unneeded return value mm/vmstat: simplify the array size calculation mm/vmstat: correct some wrong comments mm/percpu,c: remove obsolete comments of pcpu_chunk_populated() selftests: vm: add COW time test for KSM pages selftests: vm: add KSM merging time test mm: KSM: fix data type selftests: vm: add KSM merging across nodes test selftests: vm: add KSM zero page merging test selftests: vm: add KSM unmerge test selftests: vm: add KSM merge test mm/migrate: correct kernel-doc notation mm: wire up syscall process_mrelease mm: introduce process_mrelease system call memblock: make memblock_find_in_range method private mm/mempolicy.c: use in_task() in mempolicy_slab_node() mm/mempolicy: unify the create() func for bind/interleave/prefer-many policies mm/mempolicy: advertise new MPOL_PREFERRED_MANY mm/hugetlb: add support for mempolicy MPOL_PREFERRED_MANY ...
2021-09-03fs, mm: fix race in unlinking swapfileHugh Dickins
We had a recurring situation in which admin procedures setting up swapfiles would race with test preparation clearing away swapfiles; and just occasionally that got stuck on a swapfile "(deleted)" which could never be swapped off. That is not supposed to be possible. 2.6.28 commit f9454548e17c ("don't unlink an active swapfile") admitted that it was leaving a race window open: now close it. may_delete() makes the IS_SWAPFILE check (amongst many others) before inode_lock has been taken on target: now repeat just that simple check in vfs_unlink() and vfs_rename(), after taking inode_lock. Which goes most of the way to fixing the race, but swapon() must also check after it acquires inode_lock, that the file just opened has not already been unlinked. Link: https://lkml.kernel.org/r/e17b91ad-a578-9a15-5e3-4989e0f999b5@google.com Fixes: f9454548e17c ("don't unlink an active swapfile") Signed-off-by: Hugh Dickins <hughd@google.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-08-31Merge tag 'for-5.15-tag' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs updates from David Sterba: "The highlights of this round are integrations with fs-verity and idmapped mounts, the rest is usual mix of minor improvements, speedups and cleanups. There are some patches outside of btrfs, namely updating some VFS interfaces, all straightforward and acked. Features: - fs-verity support, using standard ioctls, backward compatible with read-only limitation on inodes with previously enabled fs-verity - idmapped mount support - make mount with rescue=ibadroots more tolerant to partially damaged trees - allow raid0 on a single device and raid10 on two devices, degenerate cases but might be useful as an intermediate step during conversion to other profiles - zoned mode block group auto reclaim can be disabled via sysfs knob Performance improvements: - continue readahead of node siblings even if target node is in memory, could speed up full send (on sample test +11%) - batching of delayed items can speed up creating many files - fsync/tree-log speedups - avoid unnecessary work (gains +2% throughput, -2% run time on sample load) - reduced lock contention on renames (on dbench +4% throughput, up to -30% latency) Fixes: - various zoned mode fixes - preemptive flushing threshold tuning, avoid excessive work on almost full filesystems Core: - continued subpage support, preparation for implementing remaining features like compression and defragmentation; with some limitations, write is now enabled on 64K page systems with 4K sectors, still considered experimental - no readahead on compressed reads - inline extents disabled - disabled raid56 profile conversion and mount - improved flushing logic, fixing early ENOSPC on some workloads - inode flags have been internally split to read-only and read-write incompat bit parts, used by fs-verity - new tree items for fs-verity - descriptor item - Merkle tree item - inode operations extended to be namespace-aware - cleanups and refactoring Generic code changes: - fs: new export filemap_fdatawrite_wbc - fs: removed sync_inode - block: bio_trim argument type fixups - vfs: add namespace-aware lookup" * tag 'for-5.15-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (114 commits) btrfs: reset replace target device to allocation state on close btrfs: zoned: fix ordered extent boundary calculation btrfs: do not do preemptive flushing if the majority is global rsv btrfs: reduce the preemptive flushing threshold to 90% btrfs: tree-log: check btrfs_lookup_data_extent return value btrfs: avoid unnecessarily logging directories that had no changes btrfs: allow idmapped mount btrfs: handle ACLs on idmapped mounts btrfs: allow idmapped INO_LOOKUP_USER ioctl btrfs: allow idmapped SUBVOL_SETFLAGS ioctl btrfs: allow idmapped SET_RECEIVED_SUBVOL ioctls btrfs: relax restrictions for SNAP_DESTROY_V2 with subvolids btrfs: allow idmapped SNAP_DESTROY ioctls btrfs: allow idmapped SNAP_CREATE/SUBVOL_CREATE ioctls btrfs: check whether fsgid/fsuid are mapped during subvolume creation btrfs: allow idmapped permission inode op btrfs: allow idmapped setattr inode op btrfs: allow idmapped tmpfile inode op btrfs: allow idmapped symlink inode op btrfs: allow idmapped mkdir inode op ...
2021-08-30Merge tag 'for-5.15/io_uring-vfs-2021-08-30' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull io_uring mkdirat/symlinkat/linkat support from Jens Axboe: "This adds io_uring support for mkdirat, symlinkat, and linkat" * tag 'for-5.15/io_uring-vfs-2021-08-30' of git://git.kernel.dk/linux-block: io_uring: add support for IORING_OP_LINKAT io_uring: add support for IORING_OP_SYMLINKAT io_uring: add support for IORING_OP_MKDIRAT namei: update do_*() helpers to return ints namei: make do_linkat() take struct filename namei: add getname_uflags() namei: make do_symlinkat() take struct filename namei: make do_mknodat() take struct filename namei: make do_mkdirat() take struct filename namei: change filename_parentat() calling conventions namei: ignore ERR/NULL names in putname()
2021-08-23io_uring: add support for IORING_OP_LINKATDmitry Kadashev
IORING_OP_LINKAT behaves like linkat(2) and takes the same flags and arguments. In some internal places 'hardlink' is used instead of 'link' to avoid confusion with the SQE links. Name 'link' conflicts with the existing 'link' member of io_kiocb. Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Suggested-by: Christian Brauner <christian.brauner@ubuntu.com> Link: https://lore.kernel.org/io-uring/20210514145259.wtl4xcsp52woi6ab@wittgenstein/ Signed-off-by: Dmitry Kadashev <dkadashev@gmail.com> Acked-by: Christian Brauner <christian.brauner@ubuntu.com> Link: https://lore.kernel.org/r/20210708063447.3556403-12-dkadashev@gmail.com [axboe: add splice_fd_in check] Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-23io_uring: add support for IORING_OP_SYMLINKATDmitry Kadashev
IORING_OP_SYMLINKAT behaves like symlinkat(2) and takes the same flags and arguments. Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Suggested-by: Christian Brauner <christian.brauner@ubuntu.com> Link: https://lore.kernel.org/io-uring/20210514145259.wtl4xcsp52woi6ab@wittgenstein/ Signed-off-by: Dmitry Kadashev <dkadashev@gmail.com> Acked-by: Christian Brauner <christian.brauner@ubuntu.com> Link: https://lore.kernel.org/r/20210708063447.3556403-11-dkadashev@gmail.com [axboe: add splice_fd_in check] Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-23namei: update do_*() helpers to return intsDmitry Kadashev
Update the following to return int rather than long, for uniformity with the rest of the do_* helpers in namei.c: * do_rmdir() * do_unlinkat() * do_mkdirat() * do_mknodat() * do_symlinkat() Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <christian.brauner@ubuntu.com> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/io-uring/20210514143202.dmzfcgz5hnauy7ze@wittgenstein/ Signed-off-by: Dmitry Kadashev <dkadashev@gmail.com> Acked-by: Christian Brauner <christian.brauner@ubuntu.com> Link: https://lore.kernel.org/r/20210708063447.3556403-9-dkadashev@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-23namei: make do_linkat() take struct filenameDmitry Kadashev
Pass in the struct filename pointers instead of the user string, for uniformity with do_renameat2, do_unlinkat, do_mknodat, etc. Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <christian.brauner@ubuntu.com> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/io-uring/20210330071700.kpjoyp5zlni7uejm@wittgenstein/ Signed-off-by: Dmitry Kadashev <dkadashev@gmail.com> Acked-by: Christian Brauner <christian.brauner@ubuntu.com> Link: https://lore.kernel.org/r/20210708063447.3556403-8-dkadashev@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-23namei: add getname_uflags()Dmitry Kadashev
There are a couple of places where we already open-code the (flags & AT_EMPTY_PATH) check and io_uring will likely add another one in the future. Let's just add a simple helper getname_uflags() that handles this directly and use it. Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <christian.brauner@ubuntu.com> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/io-uring/20210415100815.edrn4a7cy26wkowe@wittgenstein/ Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com> Signed-off-by: Dmitry Kadashev <dkadashev@gmail.com> Acked-by: Christian Brauner <christian.brauner@ubuntu.com> Link: https://lore.kernel.org/r/20210708063447.3556403-7-dkadashev@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-23namei: make do_symlinkat() take struct filenameDmitry Kadashev
Pass in the struct filename pointers instead of the user string, for uniformity with the recently converted do_mkdnodat(), do_unlinkat(), do_renameat(), do_mkdirat(). Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <christian.brauner@ubuntu.com> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/io-uring/20210330071700.kpjoyp5zlni7uejm@wittgenstein/ Signed-off-by: Dmitry Kadashev <dkadashev@gmail.com> Acked-by: Christian Brauner <christian.brauner@ubuntu.com> Link: https://lore.kernel.org/r/20210708063447.3556403-6-dkadashev@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-23namei: make do_mknodat() take struct filenameDmitry Kadashev
Pass in the struct filename pointers instead of the user string, for uniformity with the recently converted do_unlinkat(), do_renameat(), do_mkdirat(). Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <christian.brauner@ubuntu.com> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/io-uring/20210330071700.kpjoyp5zlni7uejm@wittgenstein/ Signed-off-by: Dmitry Kadashev <dkadashev@gmail.com> Acked-by: Christian Brauner <christian.brauner@ubuntu.com> Link: https://lore.kernel.org/r/20210708063447.3556403-5-dkadashev@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-23namei: make do_mkdirat() take struct filenameDmitry Kadashev
Pass in the struct filename pointers instead of the user string, and update the three callers to do the same. This is heavily based on commit dbea8d345177 ("fs: make do_renameat2() take struct filename"). This behaves like do_unlinkat() and do_renameat2(). Cc: Al Viro <viro@zeniv.linux.org.uk> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Dmitry Kadashev <dkadashev@gmail.com> Acked-by: Christian Brauner <christian.brauner@ubuntu.com> Link: https://lore.kernel.org/r/20210708063447.3556403-4-dkadashev@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-23namei: change filename_parentat() calling conventionsDmitry Kadashev
Since commit 5c31b6cedb675 ("namei: saner calling conventions for filename_parentat()") filename_parentat() had the following behavior WRT the passed in struct filename *: * On error the name is consumed (putname() is called on it); * On success the name is returned back as the return value; Now there is a need for filename_create() and filename_lookup() variants that do not consume the passed filename, and following the same "consume the name only on error" semantics is proven to be hard to reason about and result in confusing code. Hence this preparation change splits filename_parentat() into two: one that always consumes the name and another that never consumes the name. This will allow to implement two filename_create() variants in the same way, and is a consistent and hopefully easier to reason about approach. Link: https://lore.kernel.org/io-uring/CAOKbgA7MiqZAq3t-HDCpSGUFfco4hMA9ArAE-74fTpU+EkvKPw@mail.gmail.com/ Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <christian.brauner@ubuntu.com> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Dmitry Kadashev <dkadashev@gmail.com> Link: https://lore.kernel.org/r/20210708063447.3556403-3-dkadashev@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-23namei: ignore ERR/NULL names in putname()Dmitry Kadashev
Supporting ERR/NULL names in putname() makes callers code cleaner, and is what some other path walking functions already support for the same reason. This also removes a few existing IS_ERR checks before putname(). Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/io-uring/CAHk-=wgCac9hBsYzKMpHk0EbLgQaXR=OUAjHaBtaY+G8A9KhFg@mail.gmail.com/ Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <christian.brauner@ubuntu.com> Signed-off-by: Dmitry Kadashev <dkadashev@gmail.com> Link: https://lore.kernel.org/r/20210708063447.3556403-2-dkadashev@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-23namei: add mapping aware lookup helperChristian Brauner
Various filesystems rely on the lookup_one_len() helper to lookup a single path component relative to a well-known starting point. Allow such filesystems to support idmapped mounts by adding a version of this helper to take the idmap into account when calling inode_permission(). This change is a required to let btrfs (and other filesystems) support idmapped mounts. Cc: Christoph Hellwig <hch@infradead.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: linux-fsdevel@vger.kernel.org Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-08-23fs: remove mandatory file locking supportJeff Layton
We added CONFIG_MANDATORY_FILE_LOCKING in 2015, and soon after turned it off in Fedora and RHEL8. Several other distros have followed suit. I've heard of one problem in all that time: Someone migrated from an older distro that supported "-o mand" to one that didn't, and the host had a fstab entry with "mand" in it which broke on reboot. They didn't actually _use_ mandatory locking so they just removed the mount option and moved on. This patch rips out mandatory locking support wholesale from the kernel, along with the Kconfig option and the Documentation file. It also changes the mount code to ignore the "mand" mount option instead of erroring out, and to throw a big, ugly warning. Signed-off-by: Jeff Layton <jlayton@kernel.org>
2021-07-03Merge branch 'work.namei' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs name lookup updates from Al Viro: "Small namei.c patch series, mostly to simplify the rules for nameidata state. It's actually from the previous cycle - but I didn't post it for review in time... Changes visible outside of fs/namei.c: file_open_root() calling conventions change, some freed bits in LOOKUP_... space" * 'work.namei' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: namei: make sure nd->depth is always valid teach set_nameidata() to handle setting the root as well take LOOKUP_{ROOT,ROOT_GRABBED,JUMPED} out of LOOKUP_... space switch file_open_root() to struct path