Gramine features
⚠ This is a highly technical document intended for software engineers with knowledge of OS kernels.
⛏ This is a living document. The last major update happened in November 2023 and closely corresponds to Gramine v1.6.
Gramine strives to run native, unmodified Linux applications on any platform. The SGX backend additionally strives to provide security guarantees, in particular, protect against a malicious host OS.
Gramine intercepts all application requests to the host OS. Some of these requests are processed entirely inside Gramine, and some are funneled through a thin API to the host OS. Either way, each application’s request and each host’s reply are verified for correctness and consistency. For these verifications, Gramine maintains internal, “shadow” state. Thus, Gramine defends against Iago attacks.
Gramine strives to be 100% compatible with the Linux kernel, even when it deviates from standards like POSIX (“bug-for-bug compatibility”). At the same time, Gramine is minimalistic, and implements only the most important subset of Linux functionality, enough to run portable, hardware-independent applications.
Gramine currently has two backends: execution on the host Linux OS (called gramine-direct
) and
execution inside an Intel SGX enclave (called gramine-sgx
). If some feature has quirks and
peculiarities in some backend, we describe it explicitly. More backends are possible in the future.
Features implemented in Gramine can be classified as:
Linux features: features can be (1) implemented, (2) partially implemented, or (3) not implemented at all in Gramine. If the feature is partially implemented, then we also document the parts that are implemented and the parts that are not implemented. If the feature is not implemented at all, we also specify whether there are plans to implement it in the future (and if not, the rationale why not).
Some features are not implemented by design: either they increase the Trusted Computing Base (TCB) of Gramine disproportionately, or they cannot be implemented securely.
Other features are not implemented because they are unused: some Linux features are deprecated or ill-conceived, and applications do not use them (or have fallbacks when these features are not detected).
Gramine-specific features: additional features, e.g., attestation primitives. Note that this document covers only APIs exposed to applications (like additional system calls and pseudo-files) and doesn’t cover Gramine features transparent to the app (exitless, ASLR, debugging, etc.).
Each feature has a list of related system calls and pseudo-files, for cross-reference.
Table of Contents (abridged)
Terminology
Similarly to Linux, Gramine provides two interfaces to user applications:
Linux userspace-to-kernel interface, consisting of two sub-interfaces:
Linux System Call Interface: a set of system calls which allow applications to access system resources and services. Examples:
open()
,fork()
,gettimeofday()
.Pseudo filesystems: a set of special directories with file contents containing information about the Gramine instance, system resources, hardware configuration, etc. These filesystems are generated on the fly upon Gramine startup. Examples:
/proc/cpuinfo
,/dev/attestation/quote
.
Linux kernel-to-userspace interface, in particular, two standards:
System V ABI: defines how applications invoke system calls and receive signals.
Executable and Linking Format (ELF): defines how applications are loaded from binary files.
Legend:
☑ implemented (no serious limitations)
▣ partially implemented (serious limitations or quirks)
☒ not implemented
List of system calls
Gramine implements ~170 system calls out of ~360 system calls available on Linux. Many system calls
are implemented only partially, typically because real world workloads do not use the unimplemented
functionality (for example, O_ASYNC
flag in open()
is not used widely). Some system calls are
not implemented because they are deprecated in Linux, because they are unused by real world
applications or because they don’t fit the purpose of Gramine (“virtualize a single application”).
The list of implemented system calls grows with time, as Gramine adds functionality required by real world workloads.
The below list is generated from the syscall table of Linux 6.0.
Status of system call support in Gramine
▣
open()
9a▣
stat()
9a▣
lstat()
9a▣
lseek()
9a▣
mprotect()
6☑
munmap()
6☑
rt_sigaction()
7☑
rt_sigprocmask()
7☑
rt_sigreturn()
7☑
pread64()
9a☑
pwrite64()
9a▣
access()
9a☑
pipe()
10☑
sched_yield()
4☒
mremap()
6▣
mincore()
6▣
madvise()
6☒
shmget()
17☒
shmat()
17☒
shmctl()
17☑
dup()
23☑
dup2()
23☑
pause()
7☑
nanosleep()
20▣
getitimer()
20☑
alarm()
20▣
setitimer()
20☑
getpid()
3☑
socketpair()
11b☑
fork()
1☑
vfork()
1☑
execve()
1▣
wait4()
7▣
kill()
7▣
uname()
22☒
semget()
15☒
semop()
15☒
semctl()
15☒
shmdt()
17☒
msgget()
16☒
msgsnd()
16☒
msgrcv()
16☒
msgctl()
16▣
flock()
9b☑
fsync()
9a☑
fdatasync()
9a☑
truncate()
9a☑
ftruncate()
9a☑
getdents()
9a☑
getcwd()
9a☑
chdir()
9a☑
fchdir()
9a▣
rename()
9a☑
mkdir()
9a☑
rmdir()
9a☑
creat()
9a☒
link()
9d☑
unlink()
9a☒
symlink()
9d▣
readlink()
9d☑
chmod()
9a☑
fchmod()
9a▣
chown()
9a▣
fchown()
9a☒
lchown()
9d☑
umask()
9a▣
getrlimit()
22☒
getrusage()
22▣
sysinfo()
22☒
times()
19☒
ptrace()
24▣
getuid()
8☒
syslog()
24▣
getgid()
8▣
setuid()
8▣
setgid()
8▣
geteuid()
8▣
getegid()
8▣
setpgid()
8☑
getppid()
3▣
getpgrp()
3▣
setsid()
23☒
setreuid()
8☒
setregid()
8▣
getgroups()
8▣
setgroups()
8☒
setresuid()
8☒
getresuid()
8☒
setresgid()
8☒
getresgid()
8▣
getpgid()
3☒
setfsuid()
8☒
setfsgid()
8▣
getsid()
23☒
capget()
24☒
capset()
24☑
rt_sigpending()
7☑
rt_sigtimedwait()
7☒
rt_sigqueueinfo()
7☑
rt_sigsuspend()
7☑
sigaltstack()
7☒
utime()
9a▣
mknod()
10☒
uselib()
24☒
personality()
24☒
ustat()
9a▣
statfs()
9a▣
fstatfs()
9a☒
sysfs()
9a▣
getpriority()
4▣
setpriority()
4▣
sched_setparam()
4▣
sched_getparam()
4▣
sched_setscheduler()
4▣
sched_getscheduler()
4▣
sched_get_priority_max()
4▣
sched_get_priority_min()
4▣
sched_rr_get_interval()
4▣
mlock()
6▣
munlock()
6▣
mlockall()
6▣
munlockall()
6☒
vhangup()
24☒
modify_ldt()
24☒
pivot_root()
9a☒
_sysctl()
24☒
prctl()
2▣
arch_prctl()
2☒
adjtimex()
19▣
setrlimit()
22☑
chroot()
9a☒
sync()
9a☒
acct()
24☒
settimeofday()
19☒
mount()
9a☒
umount2()
9a☒
swapon()
24☒
swapoff()
24☒
reboot()
24▣
sethostname()
22▣
setdomainname()
22☒
iopl()
24☒
ioperm()
24☒
create_module()
24☒
init_module()
24☒
delete_module()
24☒
get_kernel_syms()
24☒
query_module()
24☒
quotactl()
24☒
nfsservctl()
24☒
getpmsg()
24☒
putpmsg()
24☒
afs_syscall()
24☒
tuxcall()
24☒
security()
24☑
gettid()
3☒
readahead()
24☒
setxattr()
9a☒
lsetxattr()
9a☒
fsetxattr()
9a☑
getxattr()
9a☑
lgetxattr()
9a☑
fgetxattr()
9a☑
listxattr()
9a☑
llistxattr()
9a☑
flistxattr()
9a☒
removexattr()
9a☒
lremovexattr()
9a☒
fremovexattr()
9a▣
tkill()
7☑
time()
19▣
futex()
5☑
sched_setaffinity()
4☑
sched_getaffinity()
4☒
set_thread_area()
2☒
io_setup()
13☒
io_destroy()
13☒
io_getevents()
13☒
io_submit()
13☒
io_cancel()
13☒
get_thread_area()
2☒
lookup_dcookie()
24☒
remap_file_pages()
6☑
getdents64()
9a☑
set_tid_address()
3☒
restart_syscall()
24☒
semtimedop()
15▣
fadvise64()
9a☒
timer_create()
20☒
timer_settime()
20☒
timer_gettime()
20☒
timer_getoverrun()
20☒
timer_delete()
20☒
clock_settime()
19▣
clock_gettime()
19▣
clock_getres()
19▣
clock_nanosleep()
20☑
exit_group()
1☑
tgkill()
7☒
utimes()
9a☒
vserver()
24▣
mbind()
6☒
set_mempolicy()
6☒
get_mempolicy()
6☒
mq_open()
16☒
mq_unlink()
16☒
mq_timedsend()
16☒
mq_timedreceive()
16☒
mq_notify()
16☒
mq_getsetattr()
16☒
kexec_load()
24▣
waitid()
7☒
add_key()
24☒
request_key()
24☒
keyctl()
24☒
ioprio_set()
4☒
ioprio_get()
4☒
inotify_init()
9c☒
inotify_add_watch()
9c☒
inotify_rm_watch()
9c☒
migrate_pages()
6▣
openat()
9a☑
mkdirat()
9a▣
mknodat()
10▣
fchownat()
9a☒
futimesat()
9a▣
newfstatat()
9a☑
unlinkat()
9a▣
renameat()
9a☒
linkat()
9d☒
symlinkat()
9d▣
readlinkat()
9d☑
fchmodat()
9a▣
faccessat()
9a☑
set_robust_list()
5☑
get_robust_list()
5☒
splice()
24☒
tee()
24☒
sync_file_range()
9a☒
vmsplice()
24☒
move_pages()
6☒
utimensat()
9a☒
signalfd()
7☒
timerfd_create()
20▣
eventfd()
14▣
fallocate()
9a☒
timerfd_settime()
20☒
timerfd_gettime()
20☒
signalfd4()
7▣
eventfd2()
14☑
dup3()
23▣
pipe2()
10☒
inotify_init1()
9c☑
preadv()
9a☑
pwritev()
9a☒
rt_tgsigqueueinfo()
7☒
perf_event_open()
24☒
fanotify_init()
9c☒
fanotify_mark()
9c▣
prlimit64()
22☒
name_to_handle_at()
9a☒
open_by_handle_at()
9a☒
clock_adjtime()
19☒
syncfs()
9a☒
setns()
24☒
process_vm_readv()
24☒
process_vm_writev()
24☒
kcmp()
1☒
finit_module()
24☒
sched_setattr()
4☒
sched_getattr()
4☒
renameat2()
9a☒
seccomp()
24☑
getrandom()
21☒
memfd_create()
6☒
kexec_file_load()
24☒
bpf()
24☒
execveat()
1☒
userfaultfd()
7☒
membarrier()
6▣
mlock2()
6☒
copy_file_range()
24☒
preadv2()
9a☒
pwritev2()
9a☒
pkey_mprotect()
24☒
pkey_alloc()
24☒
pkey_free()
24☒
statx()
9a☒
io_pgetevents()
24☒
rseq()
24☒
pidfd_send_signal()
7☒
io_uring_setup()
13☒
io_uring_enter()
13☒
io_uring_register()
13☒
open_tree()
24☒
move_mount()
9a☒
fsopen()
24☒
fsconfig()
24☒
fsmount()
24☒
fspick()
24☒
pidfd_open()
7☑
close_range()
9a☒
openat2()
9a☒
pidfd_getfd()
7☒
faccessat2()
9a☒
mount_setattr()
9a☒
quotactl_fd()
24☒
landlock_create_ruleset()
24☒
landlock_add_rule()
24☒
landlock_restrict_self()
24☒
memfd_secret()
6☒
futex_waitv()
5☒
set_mempolicy_home_node()
6
List of pseudo-files
Gramine partially emulates Linux pseudo-filesystems: /dev
, /proc
and /sys
.
Only a subset of most widely used pseudo-files is implemented. The list of implemented pseudo-files grows with time, as Gramine adds functionality required by real-world workloads.
List of all pseudo-files in Gramine
-
☑
/dev/attestation/
25☑
/dev/null
23☑
/dev/zero
23☑
/dev/random
21☑
/dev/urandom
21☑
/dev/stdin
9d☑
/dev/stdout
9d☑
/dev/stderr
9d
▣
/sys/devices/system/
22▣
/sys/devices/system/cpu/
22▣
/sys/devices/system/cpu/cpu[x]/
22▣
/sys/devices/system/cpu/cpu[x]/cache/index[x]/
22☑
/sys/devices/system/cpu/cpu[x]/cache/index[x]/coherency_line_size
22☑
/sys/devices/system/cpu/cpu[x]/cache/index[x]/level
22☑
/sys/devices/system/cpu/cpu[x]/cache/index[x]/number_of_sets
22☑
/sys/devices/system/cpu/cpu[x]/cache/index[x]/physical_line_partition
22☑
/sys/devices/system/cpu/cpu[x]/cache/index[x]/shared_cpu_map
22☑
/sys/devices/system/cpu/cpu[x]/cache/index[x]/size
22☑
/sys/devices/system/cpu/cpu[x]/cache/index[x]/type
22
☑
/sys/devices/system/cpu/cpu[x]/online
22▣
/sys/devices/system/cpu/cpu[x]/topology/
22
☑
/sys/devices/system/cpu/kernel_max
22☑
/sys/devices/system/cpu/offline
22☑
/sys/devices/system/cpu/online
22☑
/sys/devices/system/cpu/possible
22☑
/sys/devices/system/cpu/present
22
▣
/sys/devices/system/node/
22▣
/sys/devices/system/node/node[x]/
22
Linux features
Processes
Gramine supports multi-processing. A Gramine instance starts the first (main) process, as specified in the entrypoint of the manifest. The first process can spawn child processes, which belong to the same Gramine instance.
Gramine can execute ELF binaries (executables and libraries) and executable scripts. Gramine
supports executing them as entrypoints and via execve()
system call. In
case of SGX backend, execve()
execution replaces a calling program with a new program in the same
SGX enclave.
Gramine supports creating child processes using fork()
, vfork()
and clone()
system calls.
vfork()
is emulated via fork()
. clone()
always means a separate process with its own address
space (i.e., CLONE_THREAD
, CLONE_FILES
, etc. flags cannot be specified). In case of SGX backend,
child processes are created in a new SGX enclave.
It is possible to disallow creation of child processes, by specifying sys.disallow_subprocesses = true
in the manifest. The intuition is that many
applications have fallbacks when they fail to spawn a child process (e.g. Python). This can be
useful in SGX environments: child processes consume EPC memory which is a limited resource.
Currently, Gramine does not fully support fork in multi-threaded applications. There is a known bug in Gramine that if one thread is performing fork and another thread modifies the internal Gramine state, the state may get corrupted (which may lead to failures).
Gramine supports process termination using exit()
and exit_group()
system calls. If there are
child processes executing and the first process exits, Gramine currently does not kill child
processes; this is however not a problem in practice because the host OS cleans up these orphaned
children.
All aforementioned system calls follow Linux semantics, barring the mentioned peculiarities. However, properties of processes not supported by Gramine (e.g. namespaces, pidfd, etc.) are ignored.
Gramine does not support disassociating parts of the process execution context (via unshare()
system call). Gramine does not support comparing two processes (via kcmp()
).
Related system calls
☑
execve()
☒
execveat()
: very rarely used by applications☑
clone()
: except exotic combinationCLONE_VM & !CLONE_THREAD & !CLONE_VFORK
☑
fork()
☑
vfork()
: with the same semantics asfork()
☑
exit()
☑
exit_group()
☒
clone3()
: very rarely used by applications☒
unshare()
: very rarely used by applications☒
kcmp()
: very rarely used by applications
Additional materials
LD_LIBRARY_PATH
environment variable is always propagated into new process, see the issue.
Threads
Gramine implements multi-threading. In case of SGX backend, all threads of one Gramine process run in the same SGX enclave.
Gramine implements per-thread:
information about signal (alternate) stack,
user/group IDs,
thread groups info,
signal mask, signal dispositions, signal queue,
futex robust list,
CPU affinity mask.
Gramine supports creating threads using clone(.. CLONE_VM | CLONE_THREAD ..)
system call and
destroying threads using exit()
system call.
Gramine does not support manipulations of thread-local storage information (via
get_thread_area()
and set_thread_area()
system calls). Instead, Gramine supports setting
arch-specific (x86-specific) thread state via arch_prctl(ARCH_GET_FS)
and
arch_prctl(ARCH_SET_FS)
. Note that Gramine does not allow arch_prctl(ARCH_GET_GS)
and
arch_prctl(ARCH_SET_GS)
– the GS register is reserved for Gramine internal usage.
Note on thread's stack size
Gramine sets the same stack size for each thread. Gramine does not support dynamic growth of the
first-thread stack (as Linux does). The stack size in Gramine can be configured via the
sys.stack.size
manifest option.
Related system calls
☑
clone()
: must have combinationCLONE_VM | CLONE_THREAD
☑
exit()
☒
get_thread_area()
: very rarely used by applications☒
set_thread_area()
: very rarely used by applications☒
prctl()
: very rarely used by applications▣
arch_prctl()
: only x86-specific subset of flags☑
ARCH_GET_FS
☑
ARCH_SET_FS
☒
ARCH_GET_GS
☒
ARCH_SET_GS
☒
clone3()
: very rarely used by applications
Process and thread identifiers
Gramine supports the following identifiers: Process IDs (PIDs), Parent Process IDs (PPIDs),
Thread IDs (TIDs). The corresponding system calls are getpid()
, getppid()
, gettid()
,
set_tid_address()
.
Gramine has dummy support for Process Group IDs (PGIDs): PGID can only be get/set for the current
process. It is impossible to get/set PGIDs of other (e.g. child) processes. The corresponding system
calls are getpgid()
, getpgrp()
, setpgid()
.
Gramine virtualizes process/thread identifiers. In other words, in-Gramine PIDs and TIDs have no correlation with host-OS PIDs and TIDs. Each Gramine instance starts a main process with PID 1.
Gramine implements a subset of pseudo-files under /proc/[pid]
: more pseudo-files for the current
process (aka /proc/self
) and its threads, less pseudo-files for remote processes (e.g. children),
and no pseudo-files for remote threads. See the list under “Related pseudo-files”.
Related system calls
☑
getpid()
☑
getppid()
☑
gettid()
☑
set_tid_address()
▣
getpgid()
: dummy, see above▣
setpgid()
: dummy, see above▣
getpgrp()
: dummy, see above
Related pseudo-files
▣
/proc/[this-pid]/
(aka/proc/self/
): only most important files implemented☑
/proc/[this-pid]/cmdline
☑
/proc/[this-pid]/cwd
☑
/proc/[this-pid]/exe
☑
/proc/[this-pid]/fd
☑
/proc/[this-pid]/maps
☑
/proc/[this-pid]/root
▣
/proc/[this-pid]/stat
: partially implemented☑
pid
,comm
,ppid
,pgrp
,num_threads
,vsize
,rss
▣
state
: always indicates “R” (running)▣
flags
: indicates onlyPF_RANDOMIZE
☒ rest fields: always zero
▣
/proc/[this-pid]/statm
: partially implemented☑
size
/VmSize
,resident
/VmRSS
☒ rest fields: always zero
▣
/proc/[this-pid]/status
: partially implemented☑
VmPeak
☒ rest fields: not printed
☑
/proc/[this-pid]/task
▣
/proc/[remote-pid]/
: minimally implemented☑
/proc/[remote-pid]/cwd
☑
/proc/[remote-pid]/exe
☑
/proc/[remote-pid]/root
▣
/proc/[local-tid]/
: same as/proc/[this-pid]
☒
/proc/[remote-tid]/
: very rarely used by applications☑
/proc/sys/kernel/pid_max
Scheduling
Gramine does not perform scheduling of threads, instead it relies on the host OS to perform scheduling. In case of SGX backend, trying to perform or control scheduling would be futile because SGX threat model has no means of control or verification of scheduling decisions of the host OS.
Gramine fully implements only a few scheduling system calls: sched_yield()
, sched_getaffinity()
,
sched_setaffinity()
. Most other scheduling system calls in Gramine have dummy implementations:
they return some default sensible values and they do not send requests to the host OS. Finally,
sched_getattr()
and sched_setattr()
are not implemented in Gramine, as no applications use them.
In other words, applications running in Gramine cannot set scheduling policy or thread priorities,
and they cannot learn currently-used policy and priorities of the host OS. See the list under
“Related system calls”.
These dummy implementations serve Gramine well. We have not yet encountered applications that would significantly benefit from scheduling system calls being properly implemented in Gramine.
To support CPU affinity masks and expose NUMA/CPU topology, Gramine implements
/sys/devices/system/cpu/
and /sys/devices/system/node/
pseudo-files. See the list in the
“System information and resource accounting” section.
Related system calls
☑
sched_yield()
☑
sched_getaffinity()
☑
sched_setaffinity()
▣
getcpu()
: dummy, returns a random allowed CPU▣
getpriority()
: dummy, returns default value▣
setpriority()
: dummy, does nothing▣
sched_getparam()
: dummy, returns default values▣
sched_setparam()
: dummy, does nothing▣
sched_getscheduler()
: dummy, returns default value▣
sched_setscheduler()
: dummy, does nothing▣
sched_get_priority_max()
: dummy, returns default value▣
sched_get_priority_min()
: dummy, returns default▣
sched_rr_get_interval()
: dummy, returns default value☒
sched_getattr()
: very rarely used by applications☒
sched_setattr()
: very rarely used by applications☒
ioprio_get()
: very rarely used by applications☒
ioprio_set()
: very rarely used by applications
Memory synchronization (futexes)
Gramine partially implements futexes.
Current implementation is limited to one process, i.e., threads calling the futex()
system call on
the same futex word must reside in the same process. Gramine does not support non-private futexes,
thus Gramine always assumes the FUTEX_PRIVATE_FLAG
flag. We have not yet encountered applications
that would require inter-process futexes.
Gramine ignores the FUTEX_CLOCK_REALTIME
flag.
Gramine supports the following futex operations: FUTEX_WAIT
, FUTEX_WAIT_BITSET
, FUTEX_WAKE
,
FUTEX_WAKE_BITSET
, FUTEX_WAKE_OP
, FUTEX_REQUEUE
, FUTEX_CMP_REQUEUE
. Priority-inheritance
(PI) futexes and operations on them are not supported.
Gramine implements getting/setting the list of robust futexes, via get_robust_list()
and
set_robust_list()
system calls.
Related system calls
▣
futex()
: see notes above☑
get_robust_list()
☑
set_robust_list()
☒
futex_waitv()
: very rarely used by applications
Memory management
Gramine implements memory-management system calls: mmap()
, mprotect()
, munmap()
, brk()
, etc.
Some exotic flags and features are not implemented, but we didn’t observe any applications that
would fail or behave incorrectly because of that.
mmap()
supports anonymous (MAP_ANONYMOUS
) and file-backed (MAP_FILE
) mappings. All commonly
used flags like MAP_SHARED
, MAP_PRIVATE
, MAP_FIXED
, MAP_FIXED_NOREPLACE
, MAP_STACK
,
MAP_GROWSDOWN
, MAP_32BIT
are supported.
In case of SGX backend, MAP_SHARED
flag is ignored for anonymous mappings, and for file-backed
mappings, it depends on the type of file:
disallowed for trusted files (these files are read-only, thus the flag is meaningless),
disallowed for allowed files (for security reasons: it would be easy to abuse it),
allowed for encrypted files (but synchronization happens only on explicit system calls like
msync()
andclose()
).
MAP_NORESERVE
’s original semantics are not implemented and it is silently ignored. However, in
case of SGX backend and on systems supporting EDMM, MAP_NORESERVE
flag is used as a
lazy-allocation heuristic/hint for anonymous mappings – instead of pre-accepting the region of
enclave pages on mmap requests, the enclave pages are lazily accepted on page-fault events.
MAP_LOCKED
, MAP_POPULATE
, MAP_NONBLOCK
, MAP_HUGETLB
, MAP_HUGE_2MB
, MAP_HUGE_1GB
flags
are ignored (allowed but have no effect). MAP_SYNC
flag is not supported.
mprotect()
supports all flags except PROT_SEM
and PROT_GROWSUP
. We haven’t encountered any
applications that would use these flags. In case of SGX backend, mprotect()
behavior differs:
on systems supporting EDMM,
mprotect()
correctly applies permissions;on systems not supporting EDMM, all enclave memory is allocated with Read-Write-Execute permissions, and
mprotect()
calls are silently ignored.
madvise()
implements only a minimal subset of functionality:
MADV_DONTNEED
is partially supported:resetting writable file-backed mappings is not implemented;
all other cases are implemented.
MADV_NORMAL
,MADV_RANDOM
,MADV_SEQUENTIAL
,MADV_WILLNEED
,MADV_FREE
,MADV_SOFT_OFFLINE
,MADV_MERGEABLE
,MADV_UNMERGEABLE
,MADV_HUGEPAGE
,MADV_NOHUGEPAGE
are ignored (allowed but have no effect).All other advice values are not supported.
Gramine does not support anonymous files (created via memfd_create()
).
Quick summary of other memory-management system calls:
munmap()
has nothing of note;mremap()
is not implemented (very rarely used by applications);msync()
implements onlyMS_SYNC
andMS_ASYNC
(MS_INVALIDATE
is not implemented);mbind()
is a no-op;mincore()
always tells that pages are not in RAM;set_mempolicy()
andget_mempolicy
are not implemented;mlock()
,munlock()
,mlockall()
,munlockall()
,mlock2()
are dummy (always return success).
As can be seen from above, many performance-improving system calls, flags and features are currently not implemented by Gramine. Keep it in mind when you observe application performance degradation.
Related system calls
☑
brk()
▣
mmap()
: see above for notes▣
mprotect()
: see above for notes☑
munmap()
▣
msync()
: does not implementMS_INVALIDATE
▣
madvise()
: see above for notes▣
mbind()
: dummy▣
mincore()
: dummy▣
mlock()
: dummy▣
munlock()
: dummy▣
mlockall()
: dummy▣
munlockall()
: dummy▣
mlock2()
: dummy☒
mremap()
: very rarely used by applications☒
remap_file_pages()
: very rarely used by applications☒
set_mempolicy()
: may be implemented in the future☒
get_mempolicy()
: may be implemented in the future☒
memfd_create()
: may be implemented in the future☒
memfd_secret()
: very rarely used by applications☒
membarrier()
: may be implemented in the future☒
move_pages()
: very rarely used by applications☒
migrate_pages()
: very rarely used by applications☒
process_madvise()
: very rarely used by applications☒
process_mrelease()
: very rarely used by applications☒
set_mempolicy_home_node()
: very rarely used by applications
Overview of Inter-Process Communication (IPC)
Gramine implements most of the Linux IPC mechanisms. In particular:
☑ Signals and process state changes
☑ Pipes
☑ FIFOs (named pipes)
▣ UNIX domain sockets
▣ File locking
▣ Shared memory (untrusted, POSIX only)
▣ Semaphores (untrusted, POSIX only)
☒ Message queues
Gramine implements pipes, FIFOs and UNIX domain sockets (UDSes) via host-OS pipes. In case of SGX backend, all pipe, FIFO and UDS communication is transparently encrypted.
For all other IPC mechanisms – currently these are signals, process state changes, file locks – Gramine emulates them via internal message passing (in case of SGX, all messages are encrypted).
Thus, Gramine implements all IPC primitives using a single host-OS primitive: pipes. This design choice means that Gramine is a distributed Library OS, in contrast to the Linux kernel which is monolithic. Each Gramine process knows only about its own state and must query peer Gramine processes to learn their state; compare it to the Linux kernel which keeps a single state for all processes running on top of it. Thus, all IPC in Gramine is performed using message passing over host-OS pipes. To govern this message passing, the first Gramine process is designated a leader which controls all message requests/responses among processes in one Gramine instance. For example, if one Gramine process spawns a new child, it requests the leader to assign a PID for this child. As another example, all POSIX-locking operations are synchronized using a special messaging protocol that is managed by the leader.
Because of this Gramine peculiarity, IPC-intensive applications may experience performance degradation. Also, some IPC-related system calls and pseudo-files are not implemented in Gramine due to the complexity of message-passing implementation.
Gramine implements limited support for POSIX shared memory (but not for System V shared memory). Please note that in case of the SGX backend, implementation of shared memory is insecure. For more information, please refer to the corresponding manifest syntax and the corresponding section in this document.
Since Gramine has support for POSIX shared memory, consequently Gramine has support for POSIX semaphores (which are built on top of POSIX shared memory). In case of the SGX backend, implementation of POSIX semaphores is insecure, similarly to POSIX shared memory. Please refer to the corresponding section in this document.
To learn more about Gramine support for each of the Linux IPC mechanisms, refer to corresponding sections below.
Additional materials
For Linux IPC overview, we recommend reading Beej’s Guide to Unix IPC.
In case of SGX backend, pipes, FIFOs, UDSes and all other IPC communication are encrypted using the TLS-PSK (TLS with Pre-Shared Keys) protocol. The pre-shared key is randomly generated for each new Gramine instance. Before establishing any pipe/IPC communication, two Gramine processes (e.g., parent and child) verify each other’s trustworthiness using SGX local attestation.
Signals and process state changes
Gramine partially implements signals (see below for some limitations). For local signals (Gramine process signals itself, e.g. SIGABRT) and signals from the host OS (e.g. host sends SIGTERM), message passing is not involved. For process-to-process signals (e.g. child process sends SIGCHLD to the parent), message passing is used.
Gramine supports both standard signals and POSIX real-time signals. Queueing and delivery semantics
are the same as in Linux. Per-thread signal masks are supported. Restart of system calls after
signal handling (if flag SA_RESTART
was specified) is supported.
Gramine implements signal dispositions, but some rarely used features are not implemented:
core dump files are never produced,
SA_NOCLDSTOP
andSA_NOCLDWAIT
signal-behavior flags are ignored,only fields
si_signo
,si_code
,si_pid
,si_uid
,si_status
,si_addr
in the data typesiginfo_t
are populated.
Gramine supports injecting a single SIGTERM signal from the host. No other signals from the host are supported. By default, Gramine ignores all signals sent by the host (including signals sent from other applications or from other Gramine instances). This limitation is for security reasons, relevant on SGX backend.
Gramine has some limitations on sending signals to processes and threads:
sending a signal to a process group is not supported (e.g.
kill(0)
sends the signal only to the current process but not to other processes),tkill()
system call cannot send signals to threads in other processes.
Gramine supports waiting for signals (via pause()
, rt_sigsuspend()
, etc. system calls).
Gramine supports waiting for processes via wait4()
and waitid()
system calls. However,
WSTOPPED
and WCONTINUED
options are not supported (we didn’t encounter applications that rely on
these options). Zombie processes are supported, though the “zombie” state is not reported in
/proc/[pid]/stat
pseudo-file.
Gramine does not currently support file descriptors for signals (via signalfd()
). Also, since
Gramine does not currently support pidfd, sending a signal via pidfd_send_signal()
is not
implemented. Gramine also does not support file descriptors for handling page faults (via
userfaultfd()
).
Gramine has limited support for pseudo-files that describe the state of remote processes/threads
(files under /proc/[remote-pid]/
and /proc/[remote-tid]/
). For details, refer to “Related
pseudo-files” in the “Process and thread identifiers” section.
Related system calls
☑
pause()
☑
rt_sigaction()
☑
rt_sigpending()
☑
rt_sigprocmask()
☑
rt_sigreturn()
☑
rt_sigsuspend()
☑
rt_sigtimedwait()
☑
sigaltstack()
☒
rt_sigqueueinfo()
: very rarely used by applications☒
rt_tgsigqueueinfo()
: very rarely used by applications☒
signalfd()
: very rarely used by applications☒
signalfd4()
: very rarely used by applications☒
pidfd_open()
: very rarely used by applications☒
pidfd_getfd()
: very rarely used by applications☒
pidfd_send_signal()
: very rarely used by applications☒
process_madvise()
: very rarely used by applications☒
process_mrelease()
: very rarely used by applications☒
userfaultfd()
: very rarely used by applications▣
kill()
: process groups not supported▣
tkill()
: remote threads not supported☑
tgkill()
▣
wait4()
:WSTOPPED
andWCONTINUED
not supported▣
waitid()
:WSTOPPED
andWCONTINUED
not supported
User and group identifiers
Gramine has dummy support for the following identifiers:
Real user ID (UID) and Real group ID (GID),
Effective user ID (EUID) and Effective group ID (EGID),
Saved set-user-ID (SUID) and Saved set-group-ID (SGID).
The corresponding system calls are:
getuid()
,getgid()
,setuid()
,setgid()
for UID and GID (implemented);geteuid()
,getegid()
for EUID and EGID (implemented);setreuid()
,setregid()
for UID + EUID and GID + EGID (not implemented);getresuid()
,setresuid()
,getresgid()
,setresgid()
for UID + EUID + SUID and GID + EGID + SGID (not implemented).
Gramine starts the application with UID = EUID = SUID and equal to loader.uid
manifest option. Similarly, the application is started with GID = EGID = SGID and
equal to loader.gid
. If these manifest options are not set, then all IDs are equal to zero, which
means root user.
During execution, the application may modify these IDs, and the changes will be visible inside the Gramine environment.
Gramine does not support Filesystem user ID (FSUID) and filesystem group ID (FSGID). The
corresponding system calls are setfsuid()
and setfsgid()
(not implemented).
Gramine has dummy support for Supplementary group IDs. The corresponding system calls are
getgroups()
and setgroups()
. Gramine starts the applications with an empty set of supplementary
groups. The application may modify this set, and the changes will be visible inside the Gramine
environment.
Currently, there are only two usages of user/group IDs in Gramine:
changing ownership of a file via
chown()
and similar system calls;passing user ID in the SIGCHLD signal information on child process termination (in
siginfo_t::si_uid
).
Gramine does not currently implement user/group ID fields in the /proc/[pid]/status
pseudo-file.
Related system calls
▣
getuid()
: dummy▣
getgid()
: dummy▣
setuid()
: dummy▣
setgid()
: dummy▣
geteuid()
: dummy▣
getegid()
: dummy▣
getgroups()
: dummy▣
setgroups()
: dummy☒
setreuid()
: very rarely used by applications, may be implemented in the future☒
setregid()
: very rarely used by applications, may be implemented in the future☒
getresuid()
: very rarely used by applications, may be implemented in the future☒
setresuid()
: very rarely used by applications, may be implemented in the future☒
getresgid()
: very rarely used by applications, may be implemented in the future☒
setresgid()
: very rarely used by applications, may be implemented in the future☒
setfsuid()
: very rarely used by applications☒
setfsgid()
: very rarely used by applications
Related pseudo-files
☒
/proc/[this-pid]/status
: fieldsUid
,Gid
,Groups
are not implemented
File systems
Gramine implements filesystem operations, but with several peculiarities and limitations.
The most important peculiarity is that Gramine does not simply mirror the host OS’s directory
hierarchy. Instead, Gramine constructs its own view on the selected subset of host’s directories and
files: this is controlled by the manifest’s FS mount points (fs.mounts
).
This feature is similar to the volumes concept in
Docker. This Gramine feature is introduced for security.
Another peculiarity is that Gramine provides several types of filesystem mounts:
passthrough mounts (contain unencrypted files, see below),
encrypted mounts (contain files that are automatically encrypted and integrity-protected).
In case of SGX backend, passthrough mounts must be of one of two kinds:
containing allowed files (not encrypted or cryptographically hashed),
containing trusted files (cryptographically hashed – effectively, their contents are mixed into MRENCLAVE on SGX).
Additionally, mounts may be hosted in one of two ways:
on the host OS (in passthrough mounts),
inside the Gramine process (in tmpfs mounts).
All files potentially used by the application must be specified in the manifest file. Instead of single files, whole directories can be specified. Refer to the manifest documentation for more details.
Gramine also provides a subset of pseudo-files that can be found in a Linux kernel. In particular,
Gramine automatically populates /proc
, /dev
and /sys
pseudo-filesystems with most widely used
pseudo-files. These pseudo-files cannot be deleted. The complete list can be found in the “List of
pseudo-files” section.
The final peculiarity is that Gramine is a distributed Library OS, as discussed in “Overview of Inter-Process Communication (IPC)” section. This means that each Gramine process knows only about its own FS state at any point in time, and must consult the host OS and/or other Gramine processes to learn about any updates. Synchronizing the FS state is a difficult task, and Gramine has only limited support for file sync. For example, two Gramine processes may want to append data to the same file, but Gramine currently does not synchronize such accesses, thus the file contents will be incorrectly overwritten.
Internally, FS implementation in Gramine follows the one in Linux kernel. Gramine implements a Virtual File System (VFS), a uniform interface for various mount types. Gramine also has the concepts of dentries (cached directory/file names for fast lookup) and inodes (metadata about files).
Gramine does not implement full filesystem stack by design. Gramine relies on the host filesystem for most operations. The only exceptions are the tmpfs filesystem and the pseudo-filesystems (implemented entirely inside Gramine).
General FS limitations in Gramine include:
no support for dynamic mounting: all mounts must be specified beforehand in the manifest;
no operations across mounts, e.g., no rename of file located in one mount to another one (note that Linux also doesn’t support such operations);
no synchronization of file offsets, file sizes, etc. between Gramine processes;
tmpfs mounts (in-memory filesystems) are not shared by Gramine processes;
File timestamps (access, modified, change timestamps) are not set/updated.
Additional materials
A mechanism for FS synchronization, as well as a general redesign of certain FS components, is a task Gramine will tackle in the future. Below are some discussions and RFCs:
File system operations
Gramine implements all classic filesystem operations, but with limitations described below.
Gramine supports opening files and directories (via open()
and openat()
system calls).
O_CLOEXEC
, O_CREAT
, O_DIRECTORY
, O_EXCL
, O_NOFOLLOW
, O_PATH
, O_TRUNC
flags are
supported. Other flags are ignored. Notable ignored flags are O_APPEND
(not yet implemented in
Gramine) and O_TMPFILE
(bug in Gramine: should not be silently ignored).
Trusted files can be opened only for reading. Already-existing encrypted files can be opened only if they were not moved or renamed on the host (this is for protection against file renaming attacks).
Gramine supports creating files and directories (via creat()
, mkdir()
, mkdirat()
system
calls), reading directories (via getdents()
), deleting files and directories (via unlink()
,
unlinkat()
, rmdir()
), renaming files and directories (via rename()
and renameat()
).
Gramine supports read and write operations on files. Appending to files is currently unsupported. Writing to trusted files is prohibited.
Gramine supports seek operations on files (lseek()
). However, seek operation happens entirely
inside Gramine (by changing the file offset), and thus may behave incorrectly on host’s device files
(which may reimplement the seek operation in a special way).
Gramine supports mmap and msync operations on files. For more information, see the “Memory management” section.
Gramine has dummy support for polling on files via poll()
, ppoll()
, select()
system calls.
Regular files always return events “there is data to read” and “writing is possible”. Other files
return an error code.
Gramine does not support epoll on files.
Gramine supports file flushes (via fsync()
and fdatasync()
). However, flushing filesystem
metadata (sync()
and syncfs()
) is not supported. Similarly, sync_file_range()
system call is
currently not supported.
Gramine supports file truncation (via truncate()
and ftruncate()
).
Gramine has very limited support of fallocate()
system call. Only mode 0 is supported (“allocating
disk space”). The emulation of this mode simply extends the file size if applicable, otherwise does
nothing. In other words, this system call doesn’t provide reliability or performance guarantees.
Gramine has dummy support of fadvise64()
system call. The emulation does nothing and always
returns success. In other words, this system call doesn’t provide any performance improvement.
Gramine has support for file mode bits. The chmod()
, fchmodat()
, fchmod()
system calls
correctly set the file mode. The umask()
system call is also supported.
Gramine has dummy support for file owner and group manipulations. In Gramine, users and groups are
dummy; see the “User and group identifiers” section for details.
Therefore, chown()
, fchownat()
, fchown()
system calls update UID and GID inside the
Gramine environment, but not on host files.
Gramine supports checking permissions on the file via access()
and faccessat()
system calls.
Recall however that users and groups are dummy in Gramine, thus the checks are also largely
irrelevant.
Gramine implements sendfile()
system call. However, this system call is emulated in an inefficient
way (for simplicity), especially in multi-threaded cases. Pay attention to this if your application
relies heavily on sendfile()
.
Gramine supports directory operations: chdir()
and fchdir()
to change the working directory, and
getcwd()
to get the current working directory.
Gramine partially supports getting file status (information about files), via stat()
, lstat()
,
fstat()
, newfstatat()
system calls. The only fields populated in the output buffer are
st_mode
, st_size
, st_uid
, st_gid
, st_blksize
(with hard-coded value), st_nlink
(with
hard-coded value), st_dev
, st_ino
. Note that Gramine currently doesn’t support links, so
lstat()
always resolves to a file (never to a symlink).
Gramine has dummy support for getting filesystem statistics via statfs()
and fstatfs()
. The only
fields populated in the output buffer are f_bsize
, f_blocks
, f_bfree
and f_bavail
, and they
all have hard-coded values.
Gramine currently does not support changing file access/modification times, via utime()
,
utimes()
, futimesat()
, utimensat()
system calls.
Mounting files and directories with extended attributes (xattr) or setting them
via setxattr()
, lsetxattr()
, fsetxattr()
, removexattr()
, lremovexattr()
, fremovexattr()
is not supported.
Reading is supported (getxattr()
, lgetxattr()
, fgetxattr()
, listxattr()
, llistxattr()
,
flistxattr()
) but always returns no attributes (which is a correct result in our case).
Related system calls
▣
open()
: implemented, with limitations▣
openat()
: implemented, with limitations☑
close()
☑
close_range()
☑
creat()
☑
mkdir()
☑
mkdirat()
☑
getdents()
☑
getdents64()
☑
unlink()
☑
unlinkat()
☑
rmdir()
▣
rename()
: cannot rename across mounts▣
renameat()
: cannot rename across mounts☑
read()
☑
pread64()
☑
readv()
☑
preadv()
☑
write()
☑
pwrite64()
☑
writev()
☑
pwritev()
▣
lseek()
: see note above▣
mmap()
: see notes above▣
msync()
: see notes above▣
select()
: dummy▣
pselect6()
: dummy▣
poll()
: dummy▣
ppoll()
: dummy☑
fsync()
☑
fdatasync()
☑
truncate()
☑
ftruncate()
▣
fallocate()
: dummy▣
fadvise64()
: dummy☑
chmod()
☑
fchmod()
☑
fchmodat()
▣
chown()
: dummy▣
fchown()
: dummy▣
fchownat()
: dummy▣
access()
: dummy▣
faccessat()
: dummy☑
umask()
▣
sendfile()
: unoptimized☑
chdir()
☑
fchdir()
☑
getcwd()
▣
stat()
: partially dummy▣
fstat()
: partially dummy▣
lstat()
: partially dummy, always resolves to actual file▣
newfstatat()
: partially dummy▣
statfs()
: partially dummy▣
fstatfs()
: partially dummy☑
chroot()
☒
name_to_handle_at()
: very rarely used by applications☒
open_by_handle_at()
: very rarely used by applications☒
openat2()
: very rarely used by applications☒
renameat2()
: very rarely used by applications☒
preadv2()
: very rarely used by applications☒
pwritev2()
: very rarely used by applications☒
epoll_create()
: very rarely used by applications☒
epoll_create1()
: very rarely used by applications☒
epoll_wait()
: very rarely used by applications☒
epoll_pwait()
: very rarely used by applications☒
epoll_pwait2()
: very rarely used by applications☒
epoll_ctl()
: very rarely used by applications☒
sync()
: very rarely used by applications☒
syncfs()
: very rarely used by applications☒
sync_file_range()
: very rarely used by applications☒
faccessat2()
: very rarely used by applications☒
statx()
: very rarely used by applications☒
sysfs()
: very rarely used by applications☒
ustat()
: very rarely used by applications☒
mount()
: very rarely used by applications☒
move_mount()
: very rarely used by applications☒
umount2()
: very rarely used by applications☒
mount_setattr()
: very rarely used by applications☒
pivot_root()
: very rarely used by applications☒
utime()
: may be implemented in the future☒
utimes()
: may be implemented in the future☒
futimesat()
: may be implemented in the future☒
utimensat()
: may be implemented in the future☑
getxattr()
☑
lgetxattr()
☑
fgetxattr()
☑
listxattr()
☑
llistxattr()
☑
flistxattr()
☒
removexattr()
☒
lremovexattr()
☒
fremovexattr()
☒
setxattr()
☒
lsetxattr()
☒
fsetxattr()
File locking
File locking operations can be considered one of the IPC mechanisms, as discussed in “Overview of Inter-Process Communication (IPC)” section. Thus, file locks are implemented via message passing in Gramine, and all lock-requests are handled in the main (leader) process.
Gramine currently implements two types of file locks:
POSIX (fcntl) locks aka Advisory record locks. In particular, the following operations are implemented:
fcntl(F_SETLK)
,fcntl(F_SETLKW)
andfcntl(F_GETLK)
.BSD (flock) locks. The following system call is implemented:
flock()
. Its support is currently experimental and not suitable for production.
Both types of file locks share the same internal implementation in Gramine. The current implementation has the following caveats:
Lock requests from other processes will always have the overhead of IPC round-trip, even if the lock is uncontested.
The main process has to be able to look up the same file, so locking will not work for files in local-process-only filesystems (e.g. tmpfs).
There is no deadlock detection (
EDEADLK
). This is only applicable to POSIX locks; BSD locks do not have deadlock detection in the first place.The lock requests cannot be interrupted (
EINTR
).The locks work only on regular files (no pipes, sockets etc.).
Similarly to Linux, BSD (flock) locks ignore deprecated LOCK_{MAND,READ,WRITE,RW}
operations.
BSD (flock) locks are currently experimental and are disabled by default. To enable them, use the
sys.experimental__enable_flock
manifest option. There is at least one problem with BSD locks
currently: they are supposed to be released when the last reference (file descriptor, or FD) to the
underlying opened file is closed, including when a process with the opened file terminates.
Unfortunately, Gramine lacks system-wide tracking of opened files’ FDs. This may lead to premature
releases of flock locks in some situations.
Related system calls
▣
fcntl()
▣
F_SETLK
: see notes above▣
F_SETLKW
: see notes above▣
F_GETLK
: see notes above
▣
flock()
: experimental, see notes above
Monitoring filesystem events (inotify, fanotify)
Gramine does not currently implement inotify and fanotify APIs. Gramine could implement them in the future, if need arises.
Related system calls
☒
inotify_init()
☒
inotify_init1()
☒
inotify_add_watch()
☒
inotify_rm_watch()
☒
fanotify_init()
☒
fanotify_mark()
Hard links and soft links (symbolic links)
There are two notions that must be discussed separately:
Host OS’s links: Gramine sees them as normal files. On Linux host, these links are currently always followed during directory/file lookup.
In-Gramine links: Gramine has no support for links (i.e., applications cannot create links).
There is one exception: some pseudo-files like
/proc/[pid]/cwd
and/proc/self
.
The above means that Gramine does not implement link()
and symlink()
system calls. Support for
readlink()
system call is limited to only pseudo-files’ links mentioned above.
Gramine may implement hard and soft links in the future.
Related system calls
☒
link()
☒
symlink()
▣
readlink()
: see note above☒
linkat()
☒
symlinkat()
▣
readlinkat()
: see note above☒
lchown()
Related pseudo-files
The following pseudo-files are symlinks. See also “Related pseudo-files” in the “Process and thread identifiers” section.
☑
/dev/
☑
/dev/stdin
☑
/dev/stdout
☑
/dev/stderr
☑
/proc/self/
☑
/proc/[pid]/
☑
/proc/[pid]/cwd
☑
/proc/[pid]/exe
☑
/proc/[pid]/root
Pipes and FIFOs (named pipes)
Pipes and FIFOs are emulated in Gramine directly as host-level pipes (to be more specific, as socketpairs for Linux hosts). In case of SGX backend, pipes and FIFOs are transparently encrypted. For additional information on general properties of IPC in Gramine, see the “Overview of Inter-Process Communication (IPC)” section.
Gramine does not allow pipe/FIFO communication between Gramine processes and the host. Gramine also does not allow communication between Gramine processes from two different Gramine instances. Communication on pipes/FIFOs is possible only between two Gramine processes in the same Gramine instance.
Gramine does not allow more than two parties on one pipe/FIFO. For example, it is impossible to implement an SPMC (Single Producer Multiple Consumers) queue using a single pipe/FIFO. (We have not encountered applications that would try to use such patterns though.)
Gramine supports creating pipes (via pipe()
and pipe2()
) and FIFOs (via mknod(S_ISFIFO)
and
mknodat(S_ISFIFO)
). The O_DIRECT
flag while creating pipes with pipe2()
is ignored. Blocking
and non-blocking pipes/FIFOs (O_NONBLOCK
flag) are supported.
Gramine supports read and write operations on pipes and FIFOs. Gramine supports generation of the
SIGPIPE
signal on write operation if the read end of a pipe has been closed. Polling on pipes and
FIFOs is supported.
Gramine supports getting information about pipes/FIFOs via the fstat()
and newfstatat()
system
calls. The only fields populated in the output buffer are st_uid
, st_gid
and st_mode
. Gramine
also supports getting the number of unread bytes in the pipe via ioctl(FIONREAD)
.
Gramine supports getting and setting pipe/FIFO status flags via fcntl(F_GETFL)
and
fcntl(F_SETFL)
. The only currently supported flag is O_NONBLOCK
; O_ASYNC
is not supported.
Gramine also supports setting blocking/non-blocking mode via ioctl(FIONBIO)
.
Related system calls
☑
pipe()
▣
pipe2()
:O_DIRECT
flag is ignored▣
mknod()
:S_ISFIFO
type is supported▣
mknodat()
:S_ISFIFO
type is supported☑
close()
☑
fstat()
☑
read()
☑
readv()
☑
write()
☑
writev()
☑
select()
☑
pselect6()
☑
poll()
☑
ppoll()
☑
epoll_create()
☑
epoll_create1()
☑
epoll_wait()
☑
epoll_pwait()
☑
epoll_ctl()
☒
epoll_pwait2()
: very rarely used by applications▣
sendfile()
: unoptimized▣
fcntl()
▣
F_GETFL
: onlyO_NONBLOCK
▣
F_SETFL
: onlyO_NONBLOCK
☒
F_GETPIPE_SZ
: very rarely used by applications☒
F_SETPIPE_SZ
: very rarely used by applications
▣
ioctl()
☑
FIONREAD
☑
FIONBIO
Networking (sockets)
Gramine supports the most important networking protocols. In particular, Gramine supports only the following protocol families:
AF_INET
(IPv4 Internet protocols, e.g. TCP/IP and UDP/IP),AF_INET6
(IPv6 Internet protocols, e.g. TCP/IP and UDP/IP),AF_UNIX
akaAF_LOCAL
(UNIX domain sockets).
Gramine supports only two types of sockets:
SOCK_STREAM
(connection-based byte streams),SOCK_DGRAM
(connectionless datagrams).
Gramine supports TCP/IP sockets and UDP/IP sockets, i.e. the combinations AF_INET
/AF_INET6
+
SOCK_STREAM
and AF_INET
/AF_INET6
+ SOCK_DGRAM
respectively. Gramine supports stream UNIX
domain sockets (AF_UNIX
+ SOCK_STREAM
), but does not support datagram UNIX domain sockets
(AF_UNIX
+ SOCK_DGRAM
).
Non-blocking sockets (SOCK_NONBLOCK
) are supported. Non-blocking connects are supported, i.e.,
cases when connect()
returns -EINPROGRESS
are supported.
Generation of the SIGPIPE
signal on send operation if the receive end of a socket has been closed
is supported.
Gramine does not implement full network stack by design. Gramine relies on the host network stack for most operations.
Other networking limitations in Gramine include:
no support for auto binding in the
listen()
system call;dummy support for ancillary data (aka control messages): received messages always indicate there is no ancillary data attached to them.
TCP/IP and UDP/IP sockets
TCP/IP and UDP/IP sockets (TCP and UDP for short) support all Berkeley sockets APIs, including
socket()
, bind()
, listen()
, connect()
, accept()
, send()
, recv()
, getsockopt()
,
setsockopt()
, getsockname()
, getpeername()
, shutdown()
, etc. system calls. Polling on TCP
and UDP sockets via poll()
, ppoll()
, select()
, epoll_*()
system calls is supported.
TCP sockets support only MSG_NOSIGNAL
, MSG_DONTWAIT
and MSG_MORE
flags in send()
,
sendto()
, sendmsg()
, sendmmsg()
system calls. Note that MSG_MORE
flag is ignored. UDP
sockets support only MSG_NOSIGNAL
and MSG_DONTWAIT
flags.
TCP sockets support only MSG_PEEK
, MSG_DONTWAIT
and MSG_TRUNC
flags in recv()
, recvfrom()
,
recvmsg()
, recvmmsg()
system calls. UDP sockets support only MSG_DONTWAIT
and MSG_TRUNC
flags.
TCP and UDP sockets support the following socket options:
SO_ACCEPTCONN
,SO_DOMAIN
,SO_TYPE
,SO_PROTOCOL
,SO_ERROR
(all read-only),SO_RCVTIMEO
,SO_SNDTIMEO
,SO_REUSEADDR
,SO_REUSEPORT
,SO_BROADCAST
,SO_KEEPALIVE
,SO_LINGER
,SO_RCVBUF
,SO_SNDBUF
,IPV6_V6ONLY
,IP_RECVERR
,IPV6_RECVERR
(allowed but ignored).
TCP sockets additionally support the following socket options: TCP_CORK
, TCP_KEEPIDLE
,
TCP_KEEPINTVL
, TCP_KEEPCNT
, TCP_NODELAY
and TCP_USER_TIMEOUT
.
Note on domain names configuration
To use libc name-resolving Berkeley socket APIs like
gethostbyname()
,gethostbyaddr()
,getaddrinfo
, one must enable thesys.enable_extra_runtime_domain_names_conf
manifest option.
Related system calls
▣
socket()
: see notes above☑
bind()
☑
listen()
☑
accept()
☑
accept4()
☑
connect()
☑
close()
☑
shutdown()
☑
getsockname()
☑
getpeername()
☑
getsockopt()
☑
setsockopt()
☑
fstat()
☑
read()
☑
readv()
☑
write()
☑
writev()
▣
recv()
: see supported flags above▣
recvfrom()
: see supported flags above▣
recvmsg()
: see supported flags above▣
recvmmsg()
: see supported flags above▣
send()
: see supported flags above▣
sendto()
: see supported flags above▣
sendmsg()
: see supported flags above▣
sendmmsg()
: see supported flags above☑
select()
☑
pselect6()
☑
poll()
☑
ppoll()
☑
epoll_create()
☑
epoll_create1()
☑
epoll_wait()
☑
epoll_pwait()
☑
epoll_ctl()
☒
epoll_pwait2()
: very rarely used by applications▣
sendfile()
: unoptimized▣
fcntl()
▣
F_GETFL
: onlyO_NONBLOCK
▣
F_SETFL
: onlyO_NONBLOCK
▣
ioctl()
☑
FIONREAD
☑
FIONBIO
Related pseudo-files
☒
/proc/sys/net/core/
☒
/proc/sys/net/ipv4/
☒
/proc/sys/net/ipv6/
UNIX domain sockets
UNIX domain sockets (UDSes) are emulated in Gramine directly as host-level pipes (to be more specific, as socketpairs for Linux hosts). In case of SGX backend, UDSes are transparently encrypted. For additional information on general properties of IPC in Gramine, see the “Overview of Inter-Process Communication (IPC)” section.
Gramine does not allow UDS communication between Gramine processes and the host. Gramine also does not allow communication between Gramine processes from two different Gramine instances. Communication on UDSes is possible only between two Gramine processes in the same Gramine instance. See also the “Pipes and FIFOs (named pipes)” section.
UDSes support all Berkeley sockets APIs, including socket()
, bind()
, listen()
, connect()
,
accept()
, send()
, recv()
, getsockopt()
, setsockopt()
, getsockname()
, getpeername()
,
shutdown()
, etc. system calls. Polling on UDSes via poll()
, ppoll()
, select()
, epoll_*()
system calls is supported.
Named UDSes are currently not visible on the Gramine filesystem (they do not have a corresponding dentry). This may be implemented in near future, please see the note below.
UDSes do not support ancillary data (aka control messages) in sendmsg()
and recvmsg()
system
calls. In particular, the SCM_RIGHTS
type is not supported; support for this type may be added in
the future.
Gramine does not support connect()
system call on an already bound UDS (via bind()
).
UDSes support only MSG_NOSIGNAL
, MSG_DONTWAIT
and MSG_MORE
flags in send()
, sendto()
,
sendmsg()
, sendmmsg()
system calls. Note that MSG_MORE
flag is ignored.
UDSes support only MSG_PEEK
, MSG_DONTWAIT
and MSG_TRUNC
flags in recv()
, recvfrom()
,
recvmsg()
, recvmmsg()
system calls.
UDSes support the following socket options:
SO_ACCEPTCONN
,SO_DOMAIN
,SO_TYPE
,SO_PROTOCOL
,SO_ERROR
(all read-only),SO_REUSEADDR
(ignored, same as in Linux).
Note on named UDSes
There is an effort to make named UDSes visible on the Gramine filesystem, see https://github.com/gramineproject/gramine/pull/1021.
Related system calls
☑
socketpair()
For other system calls, see “TCP/IP and UDP/IP sockets” subsection above.
Related pseudo-files
☒
/proc/sys/net/unix/
For other pseudo-files, see “TCP/IP and UDP/IP sockets” subsection above.
I/O multiplexing
Gramine implements I/O multiplexing system calls: select()
, pselect6()
, poll()
, ppoll()
, as
well as the epoll family of system calls (epoll_*()
). All these system calls are emulated via the
ppoll()
Linux-host system call.
Gramine supports I/O multiplexing on pipes, FIFOs, sockets and eventfd. For peculiarities of regular-files support, see the “File system operations” section.
Timeouts and signal masks are honoured. Timeout is updated on return from corresponding system calls.
Edge-triggered and level-triggered events in epoll are supported (the EPOLLET
flag).
EPOLLONESHOT
, EPOLL_NEEDS_REARM
flags are supported. EPOLLWAKEUP
flag is ignored because
Gramine does not implement autosleep.
Select and poll families of system calls are implemented in Gramine.
Poll/ppoll system calls have the following limitation:
POLLRDHUP
is always reported together withPOLLHUP
.
Epoll family of system calls has the following limitations:
No sharing of an epoll instance between processes; updates in one process (e.g. adding an fd to be monitored) won’t be visible in the other process.
EPOLLEXCLUSIVE
is a no-op; this is correct semantically, but may reduce performance of apps using this flag.Adding an epoll to another epoll instance is not currently supported.
EPOLLRDHUP
is always reported together withEPOLLHUP
.
Related system calls
☑
select()
☑
pselect6()
☑
poll()
☑
ppoll()
▣
epoll_create()
: see notes above▣
epoll_create1()
: see notes above▣
epoll_wait()
: see notes above▣
epoll_pwait()
: see notes above▣
epoll_ctl()
: see notes above☒
epoll_pwait2()
: very rarely used by applications
Asynchronous I/O
There are two asynchronous I/O APIs in Linux kernel:
Linux POSIX asynchronous I/O (Linux AIO, older API with
io_setup()
etc.),I/O uring (io_uring, newer API with
io_uring_setup()
etc.).
Gramine does not currently implement either of these APIs. Gramine could implement them in the future, if need arises.
Note that AIO provided in userspace by glibc (aio_read()
, aio_write()
, etc.) does not depend on
Gramine and is supported.
Related system calls
☒
io_setup()
☒
io_destroy()
☒
io_getevents()
☒
io_submit()
☒
io_cancel()
☒
io_uring_setup()
☒
io_uring_enter()
☒
io_uring_register()
Event notifications (eventfd)
There are two modes of eventfd:
Secure “emulate-in-Gramine” – the eventfd object is created inside Gramine, and all operations are resolved entirely inside Gramine. A dummy eventfd object is created on the host, purely to trigger read/write notifications (e.g., in epoll); eventfd values are verified inside Gramine and are never exposed to the host. Since the host is used purely for notifications, a malicious host can only induce Denial of Service (DoS) attacks; thus this implementation is secure and enabled by default. This implementation is automatically disabled if
sys.insecure__allow_eventfd
manifest option is enabled.The emulation is currently implemented at the level of a single process. The emulation may work for multi-process applications, e.g., if the child process inherits the eventfd object but doesn’t use it. However, all eventfds created in the parent process are marked as invalid in child processes, i.e. inter-process communication via eventfds is not allowed.
Note that this secure version is not able to receive events from the host OS.
Insecure “passthrough-to-host” – the eventfd object is created on the host, and all operations are delegated to the host. Since this implementation is insecure, it is disallowed by default. To use this implementation, it must be explicitly allowed via the
sys.insecure__allow_eventfd
manifest option.
Gramine supports polling on eventfd via poll()
, ppoll()
, select()
, epoll_*()
system calls,
in both secure and insecure modes.
Related system calls
☑
eventfd()
: see notes above☑
eventfd2()
: see notes above☑
close()
☑
read()
☑
write()
☑
select()
☑
pselect6()
☑
poll()
☑
ppoll()
☑
epoll_create()
☑
epoll_create1()
☑
epoll_wait()
☑
epoll_pwait()
☑
epoll_ctl()
☒
epoll_pwait2()
: very rarely used by applications
Semaphores
There are two semaphore APIs in Linux kernel:
System V semaphores (older API),
POSIX semaphores (newer API).
POSIX semaphores are technically not a Linux kernel API. Instead, they are implemented on top of the
POSIX shared memory functionality of Linux by libc (i.e., via /dev/shm
pseudo-filesystem).
Gramine currently has limited support for POSIX semaphores. Gramine does not implement System V semaphores.
Please note that in case of the SGX backend, implementation of POSIX semaphores is insecure, as semaphores are placed in shared memory which by design is allocated in untrusted non-enclave memory, and there is no way for Gramine to intercept memory accesses to shared memory regions (to provide some security guarantees).
Related system calls
☒
semget()
☒
semop()
☒
semtimedop()
☒
semctl()
Related pseudo-files
▣
/dev/shm
: partially implemented, insecure by itself, see here
Message queues
There are two message-queue APIs in Linux kernel:
System V message queue (older API),
POSIX message queue (newer API).
Gramine does not currently implement either of these APIs. Gramine could implement them in the future, if need arises.
Related system calls
☒
msgget()
☒
msgctl()
☒
msgrcv()
☒
msgsnd()
☒
mq_open()
☒
mq_getsetattr()
☒
mq_notify()
☒
mq_timedreceive()
☒
mq_timedsend()
☒
mq_unlink()
IOCTLs
By default, Gramine implements only a minimal set of IOCTL request codes. See the list under “Related system calls”.
It is possible to specify arbitrary IOCTLs (with arbitrary request codes and corresponding IOCTL
data structures), targeted for special use cases like communication with hardware accelerators (e.g.
GPUs) or implementing socket-related IOCTLs. This is achieved via sys.ioctl_structs
and
sys.allowed_ioctls
manifest options. Read the
documentation to learn how to use this feature. There is also a corresponding whitepaper on
communication with hardware accelerators. Note that arbitrary
IOCTLs specified in the manifest are pass-through and thus potentially insecure by themselves in
e.g. SGX environments!
Related system calls
▣
ioctl()
▣
TIOCGPGRP
: dummy☑
FIONBIO
☑
FIONCLEX
☑
FIOCLEX
☑
FIOASYNC
☑
FIONREAD
▣ other IOCTLs via
sys.ioctl_structs
andsys.allowed_ioctls
manifest options
Date and time
Gramine partially implements getting date/time: gettimeofday()
, time()
, clock_gettime()
,
clock_getres()
system calls.
Gramine does not distinguish between different clocks available for clock_gettime()
and
clock_getres()
. All clocks are emulated via the CLOCK_REALTIME
clock.
Gramine does not support setting or adjusting date/time: settimeofday()
, clock_settime()
,
adjtimex()
, clock_adjtime()
.
Gramine does not currently support getting process times (like user time, system time): times()
.
Note on trustworthiness of date/time on SGX
In case of SGX backend, date/time cannot be trusted because it is queried from the possibly malicious host OS. There is currently no solution to this limitation.
Related system calls
☑
gettimeofday()
☑
time()
▣
clock_gettime()
: all clocks emulated viaCLOCK_REALTIME
▣
clock_getres()
: all clocks emulated viaCLOCK_REALTIME
☒
settimeofday()
: very rarely used by applications☒
clock_settime()
: very rarely used by applications☒
adjtimex()
: very rarely used by applications☒
clock_adjtime()
: very rarely used by applications☒
times()
: may be implemented in the future
Sleeps, timers and alarms
Gramine implements sleep system calls: nanosleep()
and clock_nanosleep()
. For the latter system
call, all clocks are emulated via the CLOCK_REALTIME
clock. TIMER_ABSTIME
is supported. Both
system calls correctly update the remaining time if they were interrupted by a signal handler.
Gramine implements getting and setting the interval timer: getitimer()
and setitimer()
. Only
ITIMER_REAL
is supported.
Gramine implements alarm clocks via alarm()
.
Gramine does not currently implement the POSIX per-process timer: timer_create()
, etc. Gramine
also does not currently implement timers that notify via file descriptors. Gramine could implement
these timers in the future, if need arises.
Related system calls
☑
nanosleep()
▣
clock_nanosleep()
: all clocks emulated viaCLOCK_REALTIME
▣
getitimer()
: onlyITIMER_REAL
▣
setitimer()
: onlyITIMER_REAL
☑
alarm()
☒
timer_create()
: may be implemented in the future☒
timer_settime()
: may be implemented in the future☒
timer_gettime()
: may be implemented in the future☒
timer_getoverrun()
: may be implemented in the future☒
timer_delete()
: may be implemented in the future☒
timerfd_create()
: may be implemented in the future☒
timerfd_settime()
: may be implemented in the future☒
timerfd_gettime()
: may be implemented in the future
Randomness
Gramine implements obtaining random bytes via two Linux APIs:
getrandom()
system call,/dev/random
and/dev/urandom
pseudo-files.
In case of SGX backend, Gramine always uses only one source of random bytes: the RDRAND x86 instruction. This is a secure source of randomness.
Related system calls
☑
getrandom()
Related pseudo-files
☑
/dev/random
☑
/dev/urandom
System information and resource accounting
Gramine does not support getting resource usage metrics via the getrusage()
system call.
Gramine reports only minimal set of system information via the sysinfo()
system call: only
totalram
, totalhigh
, freeram
and freehigh
fields are populated.
Gramine reports only minimal set of kernel information via the uname()
system call: only
sysname
, nodename
, release
, version
, machine
and domainname
fields are populated. Out of
these, only nodename
is populated with host-provided name. The rest fields are hard-coded (e.g.
release
is currently hard-coded to 3.10.0
).
Gramine has dummy support for setting hostname and domain name via sethostname()
and
setdomainname()
. The set names are not propagated to the host OS or other Gramine processes.
Gramine has minimal and mostly dummy support for getting and setting resource limits, via
getrlimit()
, setrlimit()
, prlimit64()
. The prlimit64()
syscall can be issued only on the
current process. The following resources are supported:
RLIMIT_CPU
– dummy, no limit by defaultRLIMIT_FSIZE
– dummy, no limit by defaultRLIMIT_DATA
– implemented, affectsbrk()
system callRLIMIT_STACK
– dummy, equal tosys.stack.size
manifest option by defaultRLIMIT_CORE
– dummy, zero by defaultRLIMIT_RSS
– dummy, no limit by defaultRLIMIT_NPROC
– dummy, no limit by defaultRLIMIT_NOFILE
– implemented, equal tosys.fds.limit
manifest option by defaultRLIMIT_MEMLOCK
– dummy, no limit by defaultRLIMIT_AS
– dummy, no limit by defaultRLIMIT_LOCKS
– dummy, no limit by defaultRLIMIT_SIGPENDING
– dummy, no limit by defaultRLIMIT_MSGQUEUE
– dummy, ~800K by defaultRLIMIT_NICE
– dummy, zero by defaultRLIMIT_RTPRIO
– dummy, zero by defaultRLIMIT_RTTIME
– dummy, no limit by default
Gramine supports the /proc/cpuinfo
, /proc/meminfo
, /proc/stat
pseudo-files with system
information. In addition, Gramine supports CPU- and NUMA-node-specific pseudo-files under
/sys/devices/system/cpu/
and /sys/devices/system/node/
. See the list under “Related
pseudo-files”. For additional pseudo-files containing process-specific information, see the
“Process and thread identifiers” section.
Related system calls
☒
getrusage()
▣
sysinfo()
: onlytotalram
,totalhigh
,freeram
andfreehigh
▣
uname()
: onlysysname
,nodename
,release
,version
,machine
anddomainname
▣
sethostname()
: dummy▣
setdomainname()
: dummy▣
getrlimit()
: see notes above▣
setrlimit()
: see notes above▣
prlimit64()
: see notes above
Related pseudo-files
▣
/proc/cpuinfo
: partially implemented☑
processor
,vendor_id
,cpu family
,model
,model name
,stepping
,physical id
,core id
,cpu cores
,bogomips
,siblings
☑
flags
: all known CPU flags
▣
/proc/meminfo
: partially implemented☑
MemTotal
,MemFree
,MemAvailable
,Committed_AS
,VmallocTotal
☒ rest fields: always zero
▣
/proc/stat
: dummy▣
cpu
line: all fields are zeros▣
cpuX
lines: all fields are zeros▣
ctxt
line: always zero▣
btime
line: always zero▣
processes
line: always one▣
procs_running
line: always one▣
procs_blocked
line: always zero☒
intr
line☒
softirq
line
▣
/sys/devices/system/cpu/
: only most important files implemented▣
/sys/devices/system/cpu/cpu[x]/
▣
/sys/devices/system/cpu/cpu[x]/cache/index[x]/
☑
/sys/devices/system/cpu/cpu[x]/cache/index[x]/coherency_line_size
☑
/sys/devices/system/cpu/cpu[x]/cache/index[x]/level
☑
/sys/devices/system/cpu/cpu[x]/cache/index[x]/number_of_sets
☑
/sys/devices/system/cpu/cpu[x]/cache/index[x]/physical_line_partition
☑
/sys/devices/system/cpu/cpu[x]/cache/index[x]/shared_cpu_map
☑
/sys/devices/system/cpu/cpu[x]/cache/index[x]/size
☑
/sys/devices/system/cpu/cpu[x]/cache/index[x]/type
☑
/sys/devices/system/cpu/cpu[x]/online
▣
/sys/devices/system/cpu/cpu[x]/topology/
☑
/sys/devices/system/cpu/cpu[x]/topology/core_id
☑
/sys/devices/system/cpu/cpu[x]/topology/core_siblings
☑
/sys/devices/system/cpu/cpu[x]/topology/physical_package_id
☑
/sys/devices/system/cpu/cpu[x]/topology/thread_siblings
☑
/sys/devices/system/cpu/kernel_max
☑
/sys/devices/system/cpu/offline
☑
/sys/devices/system/cpu/online
☑
/sys/devices/system/cpu/possible
☑
/sys/devices/system/cpu/present
▣
/sys/devices/system/node/
: only most important files implemented▣
/sys/devices/system/node/node[x]/
☑
/sys/devices/system/node/node[x]/cpumap
☑
/sys/devices/system/node/node[x]/distance
☑
/sys/devices/system/node/node[x]/hugepages/
▣
/sys/devices/system/node/node[x]/hugepages/hugepages-[y]/nr_hugepages
: always zero
▣
/sys/devices/system/node/node[x]/meminfo
: partially implemented☑
MemTotal
,MemFree
,MemUsed
☒ rest fields: always zero
Misc
Gramine implements vDSO, with four functions: __vdso_clock_gettime()
, __vdso_gettimeofday()
,
__vdso_time()
, __vdso_getcpu()
. These functions invoke the corresponding system calls, see the
“Date and time” section and the “Scheduling” section.
Gramine implements operations on file descriptors (FDs):
duplicating FDs via
dup()
,dup2()
,dup3()
,fcntl(F_DUPFD)
,fcntl(F_DUPFD_CLOEXEC)
,getting/setting FD flags via
fcntl(F_GETFD)
andfcntl(F_SETFD)
; the only flag isFD_CLOEXEC
.
Gramine implements several arch-specific (x86-64) operations:
getting/setting the FS segment register via
arch_prctl(ARCH_GET_FS)
andarch_prctl(ARCH_SET_FS)
,getting/setting the Intel AMX feature via
arch_prctl(ARCH_GET_XCOMP_SUPP)
,arch_prctl(ARCH_GET_XCOMP_PERM)
andarch_prctl(ARCH_REQ_XCOMP_PERM)
.
Gramine implements minimal session management via setsid()
and getsid()
. It is possible to make
the calling process the leader of the new session, which is enough for many workloads (e.g. JVM).
However, there are serious limitations:
in
getsid()
, it’s not possible to get session id of other processes (only of this process),it’s impossible to send signals to a process group,
daemonization is still broken: the orphaned child is not adopted by
init
, because there is noinit
process in Gramine.
Gramine implements the /dev/null
and /dev/zero
pseudo-files.
Related system calls
☑
gettimeofday()
: implemented in vDSO▣
clock_gettime()
: implemented in vDSO☑
time()
: implemented in vDSO▣
getcpu()
: implemented in vDSO☑
dup()
☑
dup2()
☑
dup3()
▣
fcntl()
☑
F_DUPFD
☑
F_DUPFD_CLOEXEC
☑
F_GETFD
☑
F_SETFD
▣
arch_prctl()
☑
ARCH_GET_XCOMP_SUPP
☑
ARCH_GET_XCOMP_PERM
☑
ARCH_REQ_XCOMP_PERM
▣
setsid()
▣
getsid()
Related pseudo-files
☑
/dev/
☑
/dev/null
☑
/dev/zero
Advanced/infeasible, unimplemented features
Gramine does not implement the following classes of features. This is by design, to keep the codebase of Gramine minimal.
Berkeley Packet Filters (BPF) and eBPF:
bpf()
Capabilities:
capget()
,capset()
Execution control and debugging:
ptrace()
,syslog()
,perf_event_open()
,acct()
In-kernel key management (keyrings):
add_key()
,request_key()
,keyctl()
Kernel modules:
create_module()
,init_module()
,finit_module()
,delete_module()
,query_module()
,get_kernel_syms()
Memory Protection Keys:
pkey_alloc()
,pkey_mprotect()
,pkey_free()
Namespaces:
setns()
,unshare()
Paging and swapping:
swapon()
,swapoff()
,readahead()
Process execution domain:
personality()
Secure Computing (seccomp) state:
seccomp()
Zero-copy transfer of data:
splice()
,tee()
,vmsplice()
,copy_file_range()
Transfer of data between processes:
process_vm_readv()
,process_vm_writev()
Filesystem configuration context:
fsopen()
,fsconfig()
,fspick()
,fsmount()
Landlock:
landlock_create_ruleset()
,landlock_add_rule()
,landlock_restrict_self()
Misc:
vhangup()
,modify_ldt()
,kexec_load()
,kexec_file_load()
,reboot()
,iopl()
,ioperm()
,uselib()
,_sysctl()
,quotactl()
,quotactl_fd()
,nfsservctl()
,getpmsg()
,putpmsg()
,afs_syscall()
,tuxcall()
,security()
,lookup_dcookie()
,restart_syscall()
,vserver()
,io_pgetevents()
,rseq()
,open_tree()
Related system calls
☒
_sysctl()
☒
acct()
☒
add_key()
☒
afs_syscall()
☒
bpf()
☒
capget()
☒
capset()
☒
close_range()
☒
copy_file_range()
☒
create_module()
☒
delete_module()
☒
finit_module()
☒
fsconfig()
☒
fsmount()
☒
fsopen()
☒
fspick()
☒
get_kernel_syms()
☒
getpmsg()
☒
init_module()
☒
io_pgetevents()
☒
ioperm()
☒
iopl()
☒
kexec_file_load()
☒
kexec_load()
☒
keyctl()
☒
landlock_add_rule()
☒
landlock_create_ruleset()
☒
landlock_restrict_self()
☒
lookup_dcookie()
☒
modify_ldt()
☒
nfsservctl()
☒
nfsservctl()
☒
open_tree()
☒
perf_event_open()
☒
personality()
☒
pkey_alloc()
☒
pkey_free()
☒
pkey_mprotect()
☒
process_vm_readv()
☒
process_vm_writev()
☒
ptrace()
☒
putpmsg()
☒
query_module()
☒
quotactl()
☒
quotactl_fd()
☒
readahead()
☒
reboot()
☒
request_key()
☒
restart_syscall()
☒
rseq()
☒
seccomp()
☒
security()
☒
setns()
☒
splice()
☒
swapoff()
☒
swapon()
☒
syslog()
☒
tee()
☒
tuxcall()
☒
unshare()
☒
uselib()
☒
vhangup()
☒
vmsplice()
☒
vserver()
Gramine-specific features
Attestation
Gramine exposes low-level abstractions of attestation report and attestation quote objects (SGX
Report and SGX Quote accordingly, in case of SGX backend) through the /dev/attestation/
pseudo-filesystem. Manipulating with the /dev/attestation/
pseudo-files allows to program local
attestation and remote attestation flows. Additionally, the /dev/attestation/keys/
pseudo-dir
exposes pseudo-files to set encryption keys (in particular, for encrypted files).
For detailed information, refer to the “Attestation and Secret Provisioning” documentation of Gramine.
Related pseudo-files
☑
/dev/attestation/
☑
/dev/attestation/attestation_type
☑
/dev/attestation/user_report_data
☑
/dev/attestation/target_info
☑
/dev/attestation/my_target_info
☑
/dev/attestation/report
☑
/dev/attestation/quote
☑
/dev/attestation/keys
☑
/dev/attestation/keys/<key_name>
☑
/dev/attestation/keys/_sgx_mrenclave
(only for SGX)☑
/dev/attestation/keys/_sgx_mrsigner
(only for SGX)
Notes on System V ABI
⚠ Below description assumes x86-64 architecture.
Gramine implements the system-call entry point (analogous to the SYSCALL
x86 instruction ABI).
Instead of performing a context switch from userland (ring-3) to kernelspace (ring-0), Gramine
relies on the system call being routed directly into Gramine process. There are two paths how the
application’s system call requests end up in Gramine emulation:
Fast path, through patched C standard library (e.g. Glibc or musl): Gramine ships patched Glibc and musl where raw
SYSCALL
instructions are replaced with function calls into Gramine’s syscall entry point.Slow path, through an exception-handling mechanism:
In case of Linux backend, Gramine sets up a seccomp policy that redirects all syscall requests from the Linux kernel back into the Gramine process.
In case of SGX backend, Intel SGX hardware itself forbids the
SYSCALL
instruction and instead generates a#UD
(illegal instruction) exception, which is delivered into the Gramine process.
The fast path is recommended for all applications. However, some applications bypass Glibc/musl and
issue raw SYSCALL
instructions (e.g., Golang statically compiled binaries); in this case the slow
path is activated.
Gramine’s syscall entry point implementation first saves the CPU context of the current application thread on the internal stack, then calls the syscall-emulation function, which, upon returning, calls context restoring function, which passes control back to the application thread. The context consists of GPRs, FP control word (fpcw) and the SSE/AVX/… control word (mxcsr).
Note that Gramine may clobber all FP/SSE/AVX/… (extended) state except the control words. We rely on the fact that applications do not assume that this extended state is preserved across system calls. Indeed, the extended state (bar control words) is explicitly described as not preserved by the System V ABI, and we assume that no sane application issues syscalls in a non-System-V compliant manner. See System V ABI docs, “Register Usage” for more information.
Gramine supports Linux x86-64 signal frames.
Notes on application loading
Gramine can execute only ELF binaries (executables and libraries) and executable scripts. Other formats are not supported.