Diffstat (limited to 'Documentation/ia64/fsys.txt')
1 files changed, 286 insertions, 0 deletions
diff --git a/Documentation/ia64/fsys.txt b/Documentation/ia64/fsys.txt
new file mode 100644
@@ -0,0 +1,286 @@
+ Light-weight System Calls for IA-64
+ Started: 13-Jan-2003
+ Last update: 27-Sep-2003
+ David Mosberger-Tang
+Using the "epc" instruction effectively introduces a new mode of
+execution to the ia64 linux kernel. We call this mode the
+"fsys-mode". To recap, the normal states of execution are:
+ - kernel mode:
+ Both the register stack and the memory stack have been
+ switched over to kernel memory. The user-level state is saved
+ in a pt-regs structure at the top of the kernel memory stack.
+ - user mode:
+ Both the register stack and the kernel stack are in
+ user memory. The user-level state is contained in the
+ CPU registers.
+ - bank 0 interruption-handling mode:
+ This is the non-interruptible state which all
+ interruption-handlers start execution in. The user-level
+ state remains in the CPU registers and some kernel state may
+ be stored in bank 0 of registers r16-r31.
+In contrast, fsys-mode has the following special properties:
+ - execution is at privilege level 0 (most-privileged)
+ - CPU registers may contain a mixture of user-level and kernel-level
+ state (it is the responsibility of the kernel to ensure that no
+ security-sensitive kernel-level state is leaked back to
+ - execution is interruptible and preemptible (an fsys-mode handler
+ can disable interrupts and avoid all other interruption-sources
+ to avoid preemption)
+ - neither the memory-stack nor the register-stack can be trusted while
+ in fsys-mode (they point to the user-level stacks, which may
+ be invalid, or completely bogus addresses)
+In summary, fsys-mode is much more similar to running in user-mode
+than it is to running in kernel-mode. Of course, given that the
+privilege level is at level 0, this means that fsys-mode requires some
+care (see below).
+* How to tell fsys-mode
+Linux operates in fsys-mode when (a) the privilege level is 0 (most
+privileged) and (b) the stacks have NOT been switched to kernel memory
+yet. For convenience, the header file <asm-ia64/ptrace.h> provides
+The "regs" argument is a pointer to a pt_regs structure. The "task"
+argument is a pointer to the task structure to which the "regs"
+pointer belongs to. user_mode() returns TRUE if the CPU state pointed
+to by "regs" was executing in user mode (privilege level 3).
+user_stack() returns TRUE if the state pointed to by "regs" was
+executing on the user-level stack(s). Finally, fsys_mode() returns
+TRUE if the CPU state pointed to by "regs" was executing in fsys-mode.
+The fsys_mode() macro is equivalent to the expression:
+ !user_mode(regs) && user_stack(task,regs)
+* How to write an fsyscall handler
+The file arch/ia64/kernel/fsys.S contains a table of fsyscall-handlers
+(fsyscall_table). This table contains one entry for each system call.
+By default, a system call is handled by fsys_fallback_syscall(). This
+routine takes care of entering (full) kernel mode and calling the
+normal Linux system call handler. For performance-critical system
+calls, it is possible to write a hand-tuned fsyscall_handler. For
+example, fsys.S contains fsys_getpid(), which is a hand-tuned version
+of the getpid() system call.
+The entry and exit-state of an fsyscall handler is as follows:
+** Machine state on entry to fsyscall handler:
+ - r10 = 0
+ - r11 = saved ar.pfs (a user-level value)
+ - r15 = system call number
+ - r16 = "current" task pointer (in normal kernel-mode, this is in r13)
+ - r32-r39 = system call arguments
+ - b6 = return address (a user-level value)
+ - ar.pfs = previous frame-state (a user-level value)
+ - PSR.be = cleared to zero (i.e., little-endian byte order is in effect)
+ - all other registers may contain values passed in from user-mode
+** Required machine state on exit to fsyscall handler:
+ - r11 = saved ar.pfs (as passed into the fsyscall handler)
+ - r15 = system call number (as passed into the fsyscall handler)
+ - r32-r39 = system call arguments (as passed into the fsyscall handler)
+ - b6 = return address (as passed into the fsyscall handler)
+ - ar.pfs = previous frame-state (as passed into the fsyscall handler)
+Fsyscall handlers can execute with very little overhead, but with that
+speed comes a set of restrictions:
+ o Fsyscall-handlers MUST check for any pending work in the flags
+ member of the thread-info structure and if any of the
+ TIF_ALLWORK_MASK flags are set, the handler needs to fall back on
+ doing a full system call (by calling fsys_fallback_syscall).
+ o Fsyscall-handlers MUST preserve incoming arguments (r32-r39, r11,
+ r15, b6, and ar.pfs) because they will be needed in case of a
+ system call restart. Of course, all "preserved" registers also
+ must be preserved, in accordance to the normal calling conventions.
+ o Fsyscall-handlers MUST check argument registers for containing a
+ NaT value before using them in any way that could trigger a
+ NaT-consumption fault. If a system call argument is found to
+ contain a NaT value, an fsyscall-handler may return immediately
+ with r8=EINVAL, r10=-1.
+ o Fsyscall-handlers MUST NOT use the "alloc" instruction or perform
+ any other operation that would trigger mandatory RSE
+ (register-stack engine) traffic.
+ o Fsyscall-handlers MUST NOT write to any stacked registers because
+ it is not safe to assume that user-level called a handler with the
+ proper number of arguments.
+ o Fsyscall-handlers need to be careful when accessing per-CPU variables:
+ unless proper safe-guards are taken (e.g., interruptions are avoided),
+ execution may be pre-empted and resumed on another CPU at any given
+ o Fsyscall-handlers must be careful not to leak sensitive kernel'
+ information back to user-level. In particular, before returning to
+ user-level, care needs to be taken to clear any scratch registers
+ that could contain sensitive information (note that the current
+ task pointer is not considered sensitive: it's already exposed
+ through ar.k6).
+ o Fsyscall-handlers MUST NOT access user-memory without first
+ validating access-permission (this can be done typically via
+ probe.r.fault and/or probe.w.fault) and without guarding against
+ memory access exceptions (this can be done with the EX() macros
+ defined by asmmacro.h).
+The above restrictions may seem draconian, but remember that it's
+possible to trade off some of the restrictions by paying a slightly
+higher overhead. For example, if an fsyscall-handler could benefit
+from the shadow register bank, it could temporarily disable PSR.i and
+PSR.ic, switch to bank 0 (bsw.0) and then use the shadow registers as
+needed. In other words, following the above rules yields extremely
+fast system call execution (while fully preserving system call
+semantics), but there is also a lot of flexibility in handling more
+* Signal handling
+The delivery of (asynchronous) signals must be delayed until fsys-mode
+is exited. This is acomplished with the help of the lower-privilege
+transfer trap: arch/ia64/kernel/process.c:do_notify_resume_user()
+checks whether the interrupted task was in fsys-mode and, if so, sets
+PSR.lp and returns immediately. When fsys-mode is exited via the
+"br.ret" instruction that lowers the privilege level, a trap will
+occur. The trap handler clears PSR.lp again and returns immediately.
+The kernel exit path then checks for and delivers any pending signals.
+* PSR Handling
+The "epc" instruction doesn't change the contents of PSR at all. This
+is in contrast to a regular interruption, which clears almost all
+bits. Because of that, some care needs to be taken to ensure things
+work as expected. The following discussion describes how each PSR bit
+PSR.be Cleared when entering fsys-mode. A srlz.d instruction is used
+ to ensure the CPU is in little-endian mode before the first
+ load/store instruction is executed. PSR.be is normally NOT
+ restored upon return from an fsys-mode handler. In other
+ words, user-level code must not rely on PSR.be being preserved
+ across a system call.
+PSR.mfl Unchanged. Note: fsys-mode handlers must not write-registers!
+PSR.mfh Unchanged. Note: fsys-mode handlers must not write-registers!
+PSR.ic Unchanged. Note: fsys-mode handlers can clear the bit, if needed.
+PSR.i Unchanged. Note: fsys-mode handlers can clear the bit, if needed.
+PSR.dfl Unchanged. Note: fsys-mode handlers must not write-registers!
+PSR.dfh Unchanged. Note: fsys-mode handlers must not write-registers!
+PSR.db Unchanged. The kernel prevents user-level from setting a hardware
+ breakpoint that triggers at any privilege level other than 3 (user-mode).
+PSR.tb Lazy redirect. If a taken-branch trap occurs while in
+ fsys-mode, the trap-handler modifies the saved machine state
+ such that execution resumes in the gate page at
+ syscall_via_break(), with privilege level 3. Note: the
+ taken branch would occur on the branch invoking the
+ fsyscall-handler, at which point, by definition, a syscall
+ restart is still safe. If the system call number is invalid,
+ the fsys-mode handler will return directly to user-level. This
+ return will trigger a taken-branch trap, but since the trap is
+ taken _after_ restoring the privilege level, the CPU has already
+ left fsys-mode, so no special treatment is needed.
+PSR.cpl Cleared to 0.
+PSR.is Unchanged (guaranteed to be 0 on entry to the gate page).
+PSR.it Unchanged (guaranteed to be 1).
+PSR.id Unchanged. Note: the ia64 linux kernel never sets this bit.
+PSR.da Unchanged. Note: the ia64 linux kernel never sets this bit.
+PSR.dd Unchanged. Note: the ia64 linux kernel never sets this bit.
+PSR.ss Lazy redirect. If set, "epc" will cause a Single Step Trap to
+ be taken. The trap handler then modifies the saved machine
+ state such that execution resumes in the gate page at
+ syscall_via_break(), with privilege level 3.
+PSR.ed Unchanged. Note: This bit could only have an effect if an fsys-mode
+ handler performed a speculative load that gets NaTted. If so, this
+ would be the normal & expected behavior, so no special treatment is
+PSR.bn Unchanged. Note: fsys-mode handlers may clear the bit, if needed.
+ Doing so requires clearing PSR.i and PSR.ic as well.
+PSR.ia Unchanged. Note: the ia64 linux kernel never sets this bit.
+* Using fast system calls
+To use fast system calls, userspace applications need simply call
+__kernel_syscall_via_epc(). For example
+-- example fgettimeofday() call --
+-- fgettimeofday.S --
+.save ar.pfs, r11
+mov r11 = ar.pfs
+mov r2 = 0xa000000000020660;; // gate address
+ // found by inspection of System.map for the
+ // __kernel_syscall_via_epc() function. See
+ // below for how to do this for real.
+mov b7 = r2
+mov r15 = 1087 // gettimeofday syscall
+br.call.sptk.many b6 = b7
+mov ar.pfs = r11
+br.ret.sptk.many rp;; // return to caller
+-- end fgettimeofday.S --
+In reality, getting the gate address is accomplished by two extra
+values passed via the ELF auxiliary vector (include/asm-ia64/elf.h)
+ o AT_SYSINFO : is the address of __kernel_syscall_via_epc()
+ o AT_SYSINFO_EHDR : is the address of the kernel gate ELF DSO
+The ELF DSO is a pre-linked library that is mapped in by the kernel at
+the gate page. It is a proper ELF shared object so, with a dynamic
+loader that recognises the library, you should be able to make calls to
+the exported functions within it as with any other shared library.
+AT_SYSINFO points into the kernel DSO at the
+__kernel_syscall_via_epc() function for historical reasons (it was
+used before the kernel DSO) and as a convenience.