Introduction Link to heading
Have you ever wondered why certain programs run on Windows but not on Linux ? After all a binary program corresponds to instructions in form of machine code which is dependent to processor architecture.
At first glance, this might seem confusing because a binary program is simply a collection of machine code instructions, which are dependent on the processor’s architecture. However, when you take a Windows executable file (.exe) and try to run it on a Linux system, it doesn’t work directly. So, what’s the reason behind this? The key lies in System Calls.
A system call (or syscall) is a mechanism that allows a program running in the user space (or userland) to request services from the kernel which will interact with the hardware. Whenever your program deal with a file, listen for network connection, allocate memory, it relies on system calls to request these services from the kernel.
Every operating system has its own set of system calls, each associated with a specific number in the syscall table. This table is tailored to the architecture of the processor, for example, the syscall table for x86_64 is different from that of ARM-based processors.
When you try to run a Windows executable on Linux, the two operating systems have different syscall table and interpretation, and Linux simply doesn’t recognize the system calls designed for Windows. That’s why the program won’t run directly.
In this article, we will take a look on implementation of theses syscall within the linux kernel.
First look at linux kernel system call Link to heading
There is many system calls, each syscalls are associated with a number in the syscall table, note that each processor arch has his own syscall table, for example, the syscall table for x86_64 can be found here on the linux kernel source code.
To illustrate, let’s take a simple C program:
#include <stdio.h>
int main(void){
char UserInput[25];
FILE *fp;
fp=fopen("data.txt","w+");
printf("text input:\n");
fgets(UserInput,25,stdin);
fprintf(fp,"%s",UserInput);
fclose(fp);
return 0;
}
This program simply open a text file (data.txt) and ask user to type content which will be put into this file.
We can use using strace command-line utility to retrieve all syscalls used by this program.
<redacted>
openat(AT_FDCWD, "data.txt", O_RDWR|O_CREAT|O_TRUNC, 0666) = 3
newfstatat(1, "", {st_mode=S_IFCHR|0620, st_rdev=makedev(0x88, 0x1), ...}, AT_EMPTY_PATH) = 0
write(1, "text input:\n", 12text input:) = 12
newfstatat(0, "", {st_mode=S_IFCHR|0620, st_rdev=makedev(0x88, 0x1), ...}, AT_EMPTY_PATH) = 0
read(0, t"t\n", 1024) = 2
newfstatat(3, "", {st_mode=S_IFREG|0644, st_size=0, ...}, AT_EMPTY_PATH) = 0
write(3, "t\n", 2) = 2
close(3) = 0
exit_group(0) = ?
By looking on it, you can see the following system calls used:
- openat to open the text file
- write to print content on the terminal
- read to the content the user put into the program
- close to exit program
CPU privilege level and context switch Link to heading
Users program such as terminal, web browser or text editor needs to communicate with linux kernel to do a set of operations that they cannot do themselves. For example, these operation can be related to file I/O (read,write,open) or related to networking stuff (socket,bind,connect).
You may be wondering why does prevent programs in userspace to execute these operation themselves ? Most modern processors use a ring-based privilege levels
Ring 0 (Kernel Mode): Hightest privilege level, this is where the operating system kernel runs, allowing direct access to hardware and memory management. Ring 1 and Ring 2: Generally unused in most modern operating systems but can be reserved for drivers or other privileged services such as virtualbox guest kernel. Ring 3 (User Mode): Lowest privilege level, this is where user program run and they don’t have direct access to hardware or critical memory areas.
There is many reasons for using this ring-based privilege :
- System stability and integrity: incorrect CPU instructions can crash the system or corrupt files or event filesystem.
- Security : Malicious program could use hardware features to gain kernel-level access or will be free to modify sensitive data such as password and encryption keys.
Typically, when user’s programs perform a system call, they will switch from user mode to kernel mode, at the CPU privilege level, thats mean transition from Ring 3 to ring 0, this operation is a partial cpu context switch called mode switch.
Modern processors have multiple cores (which you can think of as having multiple CPUs), and each core can have processes allocated to it. Each core uses a technique called concurrency to handle multiple processes, which involves allocating time and resources to each program, then switching to another one. This switching happens so quickly that it appears as if the programs are running simultaneously, even though the CPU is switching between them very rapidly. The switches between these processes involve a full context switch, during which the entire state of a process is saved including CPU registers, flags, and other relevant information into the Process Control Block (PCB). Each process has its own corresponding PCB that holds all the necessary information to restore the process state when it’s scheduled to run again. However, mode switch does not requires full context switch . Instead, it simply saves and restores the minimum necessary state, such as the program counter and flags related to the mode change. This is because the process is still the same and it’s not being preempted for execution, the only thing that changes is the level of privilege and the operations it can perform which is generally saved into specific registers.
Invoke a linux system call Link to heading
The methods for invoking system calls have evolved from legacy approaches, like using software interrupts, to faster techniques using dedicated CPU instructions, in this part we will talk about these different system calls method.
Legacy syscall Link to heading
Legacy system call is an older method for requesting services from the operating system’s kernel, kept for retro-compatibility purposes .
This method works by using a software interrupt which belong to int
assembly instruction, to enter in kernel mode (ring 0) and execute the system call.
An interrupt is an event raised by the hardware or software when it require the cpu attention. Interrupts takes as an argument a special number, in order to perform a system call, this number is 0x80 (128) as seen at x86/include/asm/irq_vectors.h on the linux kernel source code.
That being said, one piece of the puzzle is missing, how did we know which syscall will be used ?
In the introduction, i mentioned the syscall table, now it’s time for it to come into play.
Did you remember that every syscall is assigned to a number, in order to call a specific syscall we have to put this value into the eax
register.
System call often require arguments, they takes up to 6 arguments from EBX
, ECX
, EDX
, ESI
, EDI
, EBP
See here
The syscall table used in this section can be found here
ebx
is related to arg1, ecx
to arg2 …There is an example bellow for a simple hello world program using legacy syscall.
section .data
msg db "Hello World!" ; Printed message
len equ $ - msg ; Message length
section .text
global _start
_start:
; sys_write(stdout, msg, len)
mov eax, 4 ; sys_write syscall number: 4
mov ebx, 1 ; ebx: arg1, 1 = stdout
mov ecx, msg ; ecx: arg2
mov edx, len ; edx,arg3
int 0x80 ; Trigger system call
; sys_exit(0)
mov eax, 1 ; sys_exit syscall number : 1
xor ebx, ebx ; Set ebx to 0 => exit number
int 0x80 ; Trigger system call
You can build and run this piece of code it using following commands if your processor support it (x86):
nasm -f elf32 helloworld.asm -o helloworld.o
ld -m elf_i386 helloworld.o -o helloworld
./helloworld
After executing the syscall, how does the kernel return back to the user program and drop the privilege ? This is done by performing a kernel mode to user mode mode switch.
When a system call is made using a software interrupt, there is underlying mechanism which makes the CPU automatically saves important information such as the return address, CPU flags, and other state details on the stack. This saved context allows the kernel to know exactly where to resume execution in the user program once the system call is finished. To return to user space, the kernel restores the saved context by copying these values from the kernel stack back into the CPU registers. This includes restoring the return address and CPU flags, which allows the program to continue executing exactly where it left off, now back in user space.
One of the most common ways for the kernel to return to the user program after a system call is by using the iret
(Interrupt Return) instruction which is specially designed to restore the CPU state saved during the system call.
To be more specific, it pops the return address (the location in the user program where execution should resume), the code segment selector, which determines the privilege level and the CPU flag.
Fast system calls Link to heading
Unlike The traditional approach of legacy syscalls appears to be quite practical, modern techniques exist for invoking a system call without relying on a software interrupt and significantly outperform software interrupts in terms of speed.
These modern techniques are called Fast system call and relies on a pair of dedicated CPU instruction : one for transition into the kernel and other for exiting it.
32 bits systems use sysenter
and sysexit
64 bits systems uses syscall
and sysret
32 bits systems Link to heading
Instead of using a software interrupt (int 0x80
) to switch to kernel mode, fast system calls use the sysenter
instruction. This instruction is hardwired in modern CPUs to directly transition from user mode (ring 3) to kernel mode (ring 0) with minimal overhead.
To return from the system call, the kernel uses the sysexit
instruction, which efficiently switches back to user mode.
This is possible because sysenter and sysexit are paired instructions, optimized to work together. They do not require the CPU to perform the extensive privilege checks and interrupt vector lookups associated with software interrupts.
Before using sysenter, the kernel must configure different Model-Specific Registers (MSR), see Instruction Set reference for more details. The MSR are control registers that have a specific purpose to control certain features of the cpu, MSR list can be found here.
As the name suggests, MSRs (Model-Specific Registers) are related to specific CPU families or models. However, some of them are present as a baseline across all ia32 processors, ensuring common functionality, like the 3 MSR realted to sysenter as seen on the link to the instruction set reference just above. If you are interested to specific MSR, refer to specific CPU model documentation.
Like legacy system calls, there is also e convention on where the system call number and argument can be placed (reference).
The system call number is placed in the eax
register to specify the desired system call.
Arguments are passed in the following registers: ebx
, ecx
, edx
, esi
, edi
, and ebp
, in the same order as the legacy system call convention.
The syscall table used for sysenter on 32-bit x86 can be found here.
After executing the system call, the kernel needs to return to the user program and drop the privilege level. This is doe using the sysexit
instruction.
Unlike the legacy method that relies on the iret instruction to pop values off the stack and restore the CPU state, sysexit is specifically optimized for fast system call exits. It directly restores the following:
- EIP from EDX: The instruction pointer in user mode.
- ESP from ECX: The stack pointer in user mode.
- CS and SS segments are set to predefined values suitable for user mode execution.
This means the kernel must prepare rdx
and rcx
with the return address and stack pointer BEFORE executing sysexit.
64 bits systems Link to heading
On 64 bits system, its much more simpler and looks more like what we saw in the legacy system calls section.
This part involves using syscall
and sysret
instructions.
For the kernel to receive incoming system calls, it must register the address of the code that will execute when a system call occurs by writing its address to the MSR_LSTAR msr .
There is a defined convention for making system calls with syscall instruction(reference).
The userland program is expected to put the system call number to be in the rax
register. The arguments to the syscall are expected to be placed in a subset of the general purpose registers precisely rdi
, rsi
, rdx
, r10
, r8
and r9
In order to return from a syscall, the instruction sysret
is defined for this purpose.
In order ot use that instruction, simply copy the address where the execution should resume into the rcx
register when syscall instruction is used.
sysret
glibc wrapper Link to heading
The simplest way to invoke a system call is by using glibc which is a library in Linux that provides standard functions to help your programs interact with the operating system. The advantage of using glibc is that it performs important tasks behind the scenes before and after calling syscalls, that why calling syscall directly in your own assembly is not a good idea.
there is an example of c code using a glibc wrapper:
#include <unistd.h>
int main(void){
write(1,"hello\n",6);
return 0;
}
You can also call write() using syscall() wrapper from glibc
#include <unistd.h>
int main(void){
int syscall_number=1; //syscall number for write() = 0x1
syscall(syscall_number,1,"hello\n",6);
return 0;
}
Virtual dynamic shared object (VDSO) Link to heading
At the beginning of this article, i said in order to perform a system call, there is a mode switch from a usermode to kernel mode. However, a large amount of this operation can be expensive in term of cpu resources. For that reason, linux kernel introducing Virtual dynamic shared object, VDSO.
The linux VDSO is a set of code that is part of the kernel, but is mapped into the address space of a user program to be run in userland. This allows to run some syscall WITHOUT entering into the kernel.
One of these is time()
syscall , program calling this syscall do not enter in kernel mode, instead they just do a function call to a piece of code provided by the kernel but run in userland.
conclusion Link to heading
While system calls are fundamental to understand how programs interact with the operating system, the Linux syscall interface is complex. Each syscall is uniquely linked to specific tasks like file IO, memory management, networking, and much more. Additionally, the syscall table varies depending on the processor architecture, which add even more complexity.
References Link to heading
https://0xax.gitbooks.io/linux-insides/content/Interrupts/linux-interrupts-1.html https://0xax.gitbooks.io/linux-insides/content/SysCall/ https://blog.packagecloud.io/the-definitive-guide-to-linux-system-calls/ https://eclypsium.com/blog/direct-memory-access-attacks-a-walk-down-memory-lane/