Caution

This is documentation for the development version of the project, aka master branch. If you installed Gramine from packages, documentation for the stable version is available at https://gramine.readthedocs.io/en/stable/.

Gramine features

⚠ This is a highly technical document intended for software engineers with knowledge of OS kernels.

⛏ This is a living document. The last major update happened in November 2023 and closely corresponds to Gramine v1.6.

Gramine strives to run native, unmodified Linux applications on any platform. The SGX backend additionally strives to provide security guarantees, in particular, protect against a malicious host OS.

Gramine intercepts all application requests to the host OS. Some of these requests are processed entirely inside Gramine, and some are funneled through a thin API to the host OS. Either way, each application’s request and each host’s reply are verified for correctness and consistency. For these verifications, Gramine maintains internal, “shadow” state. Thus, Gramine defends against Iago attacks.

Gramine strives to be 100% compatible with the Linux kernel, even when it deviates from standards like POSIX (“bug-for-bug compatibility”). At the same time, Gramine is minimalistic, and implements only the most important subset of Linux functionality, enough to run portable, hardware-independent applications.

Gramine currently has two backends: execution on the host Linux OS (called gramine-direct) and execution inside an Intel SGX enclave (called gramine-sgx). If some feature has quirks and peculiarities in some backend, we describe it explicitly. More backends are possible in the future.

Features implemented in Gramine can be classified as:

  • Linux features: features can be (1) implemented, (2) partially implemented, or (3) not implemented at all in Gramine. If the feature is partially implemented, then we also document the parts that are implemented and the parts that are not implemented. If the feature is not implemented at all, we also specify whether there are plans to implement it in the future (and if not, the rationale why not).

    • Some features are not implemented by design: either they increase the Trusted Computing Base (TCB) of Gramine disproportionately, or they cannot be implemented securely.

    • Other features are not implemented because they are unused: some Linux features are deprecated or ill-conceived, and applications do not use them (or have fallbacks when these features are not detected).

  • Gramine-specific features: additional features, e.g., attestation primitives. Note that this document covers only APIs exposed to applications (like additional system calls and pseudo-files) and doesn’t cover Gramine features transparent to the app (exitless, ASLR, debugging, etc.).

Each feature has a list of related system calls and pseudo-files, for cross-reference.

Table of Contents (abridged)

Terminology

Similarly to Linux, Gramine provides two interfaces to user applications:

  • Linux userspace-to-kernel interface, consisting of two sub-interfaces:

    • Linux System Call Interface: a set of system calls which allow applications to access system resources and services. Examples: open(), fork(), gettimeofday().

    • Pseudo filesystems: a set of special directories with file contents containing information about the Gramine instance, system resources, hardware configuration, etc. These filesystems are generated on the fly upon Gramine startup. Examples: /proc/cpuinfo, /dev/attestation/quote.

  • Linux kernel-to-userspace interface, in particular, two standards:

    • System V ABI: defines how applications invoke system calls and receive signals.

    • Executable and Linking Format (ELF): defines how applications are loaded from binary files.


Legend:

  • ☑ implemented (no serious limitations)

  • ▣ partially implemented (serious limitations or quirks)

  • ☒ not implemented

List of system calls

Gramine implements ~170 system calls out of ~360 system calls available on Linux. Many system calls are implemented only partially, typically because real world workloads do not use the unimplemented functionality (for example, O_ASYNC flag in open() is not used widely). Some system calls are not implemented because they are deprecated in Linux, because they are unused by real world applications or because they don’t fit the purpose of Gramine (“virtualize a single application”).

The list of implemented system calls grows with time, as Gramine adds functionality required by real world workloads.

The below list is generated from the syscall table of Linux 6.0.

Status of system call support in Gramine
  • read() 9a 10 11a 11b 14

  • write() 9a 10 11a 11b 14

  • open() 9a

  • close() 9a 10 11a 11b 14

  • stat() 9a

  • fstat() 9a 10 11a 11b

  • lstat() 9a

  • poll() 9a 10 11a 11b 12 14

  • lseek() 9a

  • mmap() 6 9a

  • mprotect() 6

  • munmap() 6

  • brk() 6 22

  • rt_sigaction() 7

  • rt_sigprocmask() 7

  • rt_sigreturn() 7

  • ioctl() 10 11a 11b 18

  • pread64() 9a

  • pwrite64() 9a

  • readv() 9a 10 11a 11b

  • writev() 9a 10 11a 11b

  • access() 9a

  • pipe() 10

  • select() 9a 10 11a 11b 12 14

  • sched_yield() 4

  • mremap() 6

  • msync() 6 9a

  • mincore() 6

  • madvise() 6

  • shmget() 17

  • shmat() 17

  • shmctl() 17

  • dup() 23

  • dup2() 23

  • pause() 7

  • nanosleep() 20

  • getitimer() 20

  • alarm() 20

  • setitimer() 20

  • getpid() 3

  • sendfile() 9a 10 11a 11b

  • socket() 11a 11b

  • connect() 11a 11b

  • accept() 11a 11b

  • sendto() 11a 11b

  • recvfrom() 11a 11b

  • sendmsg() 11a 11b

  • recvmsg() 11a 11b

  • shutdown() 11a 11b

  • bind() 11a 11b

  • listen() 11a 11b

  • getsockname() 11a 11b

  • getpeername() 11a 11b

  • socketpair() 11b

  • setsockopt() 11a 11b

  • getsockopt() 11a 11b

  • clone() 1 2

  • fork() 1

  • vfork() 1

  • execve() 1

  • exit() 1 2

  • wait4() 7

  • kill() 7

  • uname() 22

  • semget() 15

  • semop() 15

  • semctl() 15

  • shmdt() 17

  • msgget() 16

  • msgsnd() 16

  • msgrcv() 16

  • msgctl() 16

  • fcntl() 9b 10 11a 11b 23

  • flock() 9b

  • fsync() 9a

  • fdatasync() 9a

  • truncate() 9a

  • ftruncate() 9a

  • getdents() 9a

  • getcwd() 9a

  • chdir() 9a

  • fchdir() 9a

  • rename() 9a

  • mkdir() 9a

  • rmdir() 9a

  • creat() 9a

  • link() 9d

  • unlink() 9a

  • symlink() 9d

  • readlink() 9d

  • chmod() 9a

  • fchmod() 9a

  • chown() 9a

  • fchown() 9a

  • lchown() 9d

  • umask() 9a

  • gettimeofday() 19 23

  • getrlimit() 22

  • getrusage() 22

  • sysinfo() 22

  • times() 19

  • ptrace() 24

  • getuid() 8

  • syslog() 24

  • getgid() 8

  • setuid() 8

  • setgid() 8

  • geteuid() 8

  • getegid() 8

  • setpgid() 8

  • getppid() 3

  • getpgrp() 3

  • setsid() 23

  • setreuid() 8

  • setregid() 8

  • getgroups() 8

  • setgroups() 8

  • setresuid() 8

  • getresuid() 8

  • setresgid() 8

  • getresgid() 8

  • getpgid() 3

  • setfsuid() 8

  • setfsgid() 8

  • getsid() 23

  • capget() 24

  • capset() 24

  • rt_sigpending() 7

  • rt_sigtimedwait() 7

  • rt_sigqueueinfo() 7

  • rt_sigsuspend() 7

  • sigaltstack() 7

  • utime() 9a

  • mknod() 10

  • uselib() 24

  • personality() 24

  • ustat() 9a

  • statfs() 9a

  • fstatfs() 9a

  • sysfs() 9a

  • getpriority() 4

  • setpriority() 4

  • sched_setparam() 4

  • sched_getparam() 4

  • sched_setscheduler() 4

  • sched_getscheduler() 4

  • sched_get_priority_max() 4

  • sched_get_priority_min() 4

  • sched_rr_get_interval() 4

  • mlock() 6

  • munlock() 6

  • mlockall() 6

  • munlockall() 6

  • vhangup() 24

  • modify_ldt() 24

  • pivot_root() 9a

  • _sysctl() 24

  • prctl() 2

  • arch_prctl() 2

  • adjtimex() 19

  • setrlimit() 22

  • chroot() 9a

  • sync() 9a

  • acct() 24

  • settimeofday() 19

  • mount() 9a

  • umount2() 9a

  • swapon() 24

  • swapoff() 24

  • reboot() 24

  • sethostname() 22

  • setdomainname() 22

  • iopl() 24

  • ioperm() 24

  • create_module() 24

  • init_module() 24

  • delete_module() 24

  • get_kernel_syms() 24

  • query_module() 24

  • quotactl() 24

  • nfsservctl() 24

  • getpmsg() 24

  • putpmsg() 24

  • afs_syscall() 24

  • tuxcall() 24

  • security() 24

  • gettid() 3

  • readahead() 24

  • setxattr() 9a

  • lsetxattr() 9a

  • fsetxattr() 9a

  • getxattr() 9a

  • lgetxattr() 9a

  • fgetxattr() 9a

  • listxattr() 9a

  • llistxattr() 9a

  • flistxattr() 9a

  • removexattr() 9a

  • lremovexattr() 9a

  • fremovexattr() 9a

  • tkill() 7

  • time() 19

  • futex() 5

  • sched_setaffinity() 4

  • sched_getaffinity() 4

  • set_thread_area() 2

  • io_setup() 13

  • io_destroy() 13

  • io_getevents() 13

  • io_submit() 13

  • io_cancel() 13

  • get_thread_area() 2

  • lookup_dcookie() 24

  • epoll_create() 9a 10 11a 11b 12 14

  • remap_file_pages() 6

  • getdents64() 9a

  • set_tid_address() 3

  • restart_syscall() 24

  • semtimedop() 15

  • fadvise64() 9a

  • timer_create() 20

  • timer_settime() 20

  • timer_gettime() 20

  • timer_getoverrun() 20

  • timer_delete() 20

  • clock_settime() 19

  • clock_gettime() 19

  • clock_getres() 19

  • clock_nanosleep() 20

  • exit_group() 1

  • epoll_wait() 9a 10 11a 11b 12 14

  • epoll_ctl() 9a 10 11a 11b 12 14

  • tgkill() 7

  • utimes() 9a

  • vserver() 24

  • mbind() 6

  • set_mempolicy() 6

  • get_mempolicy() 6

  • mq_open() 16

  • mq_unlink() 16

  • mq_timedsend() 16

  • mq_timedreceive() 16

  • mq_notify() 16

  • mq_getsetattr() 16

  • kexec_load() 24

  • waitid() 7

  • add_key() 24

  • request_key() 24

  • keyctl() 24

  • ioprio_set() 4

  • ioprio_get() 4

  • inotify_init() 9c

  • inotify_add_watch() 9c

  • inotify_rm_watch() 9c

  • migrate_pages() 6

  • openat() 9a

  • mkdirat() 9a

  • mknodat() 10

  • fchownat() 9a

  • futimesat() 9a

  • newfstatat() 9a

  • unlinkat() 9a

  • renameat() 9a

  • linkat() 9d

  • symlinkat() 9d

  • readlinkat() 9d

  • fchmodat() 9a

  • faccessat() 9a

  • pselect6() 9a 10 11a 11b 12 14

  • ppoll() 9a 10 11a 11b 12 14

  • unshare() 1 24

  • set_robust_list() 5

  • get_robust_list() 5

  • splice() 24

  • tee() 24

  • sync_file_range() 9a

  • vmsplice() 24

  • move_pages() 6

  • utimensat() 9a

  • epoll_pwait() 9a 10 11a 11b 12 14

  • signalfd() 7

  • timerfd_create() 20

  • eventfd() 14

  • fallocate() 9a

  • timerfd_settime() 20

  • timerfd_gettime() 20

  • accept4() 11a 11b

  • signalfd4() 7

  • eventfd2() 14

  • epoll_create1() 9a 10 11a 11b 12 14

  • dup3() 23

  • pipe2() 10

  • inotify_init1() 9c

  • preadv() 9a

  • pwritev() 9a

  • rt_tgsigqueueinfo() 7

  • perf_event_open() 24

  • recvmmsg() 11a 11b

  • fanotify_init() 9c

  • fanotify_mark() 9c

  • prlimit64() 22

  • name_to_handle_at() 9a

  • open_by_handle_at() 9a

  • clock_adjtime() 19

  • syncfs() 9a

  • sendmmsg() 11a 11b

  • setns() 24

  • getcpu() 4 23

  • process_vm_readv() 24

  • process_vm_writev() 24

  • kcmp() 1

  • finit_module() 24

  • sched_setattr() 4

  • sched_getattr() 4

  • renameat2() 9a

  • seccomp() 24

  • getrandom() 21

  • memfd_create() 6

  • kexec_file_load() 24

  • bpf() 24

  • execveat() 1

  • userfaultfd() 7

  • membarrier() 6

  • mlock2() 6

  • copy_file_range() 24

  • preadv2() 9a

  • pwritev2() 9a

  • pkey_mprotect() 24

  • pkey_alloc() 24

  • pkey_free() 24

  • statx() 9a

  • io_pgetevents() 24

  • rseq() 24

  • pidfd_send_signal() 7

  • io_uring_setup() 13

  • io_uring_enter() 13

  • io_uring_register() 13

  • open_tree() 24

  • move_mount() 9a

  • fsopen() 24

  • fsconfig() 24

  • fsmount() 24

  • fspick() 24

  • pidfd_open() 7

  • clone3() 1 2

  • close_range() 9a

  • openat2() 9a

  • pidfd_getfd() 7

  • faccessat2() 9a

  • process_madvise() 6 7

  • epoll_pwait2() 9a 10 11a 11b 12 14

  • mount_setattr() 9a

  • quotactl_fd() 24

  • landlock_create_ruleset() 24

  • landlock_add_rule() 24

  • landlock_restrict_self() 24

  • memfd_secret() 6

  • process_mrelease() 6 7

  • futex_waitv() 5

  • set_mempolicy_home_node() 6


List of pseudo-files

Gramine partially emulates Linux pseudo-filesystems: /dev, /proc and /sys.

Only a subset of most widely used pseudo-files is implemented. The list of implemented pseudo-files grows with time, as Gramine adds functionality required by real-world workloads.

List of all pseudo-files in Gramine
  • /dev/ 9d 15 17 25

    • /dev/attestation/ 25

      • /dev/attestation/attestation_type 25

      • /dev/attestation/user_report_data 25

      • /dev/attestation/target_info 25

      • /dev/attestation/my_target_info 25

      • /dev/attestation/report 25

      • /dev/attestation/keys 25

        • /dev/attestation/keys/<key_name> 25

        • /dev/attestation/keys/_sgx_mrenclave 25

        • /dev/attestation/keys/_sgx_mrsigner 25

    • /dev/null 23

    • /dev/zero 23

    • /dev/random 21

    • /dev/urandom 21

    • /dev/shm 15 17

    • /dev/stdin 9d

    • /dev/stdout 9d

    • /dev/stderr 9d

  • /proc/ 3 8 9d 22 11a 11b

    • /proc/[this-pid]/ (aka /proc/self/) 3 9d

      • /proc/[this-pid]/cmdline 3

      • /proc/[this-pid]/cwd 3 9d

      • /proc/[this-pid]/exe 3 9d

      • /proc/[this-pid]/fd 3

      • /proc/[this-pid]/maps 3

      • /proc/[this-pid]/root 3 9d

      • /proc/[this-pid]/stat 3

      • /proc/[this-pid]/statm 3

      • /proc/[this-pid]/status 3 8

      • /proc/[this-pid]/task 3

    • /proc/[remote-pid]/ 3 9d

      • /proc/[remote-pid]/cwd 3 9d

      • /proc/[remote-pid]/exe 3 9d

      • /proc/[remote-pid]/root 3 9d

    • /proc/[local-tid]/ 3

    • /proc/[remote-tid]/ 3

    • /proc/cpuinfo 22

    • /proc/meminfo 22

    • /proc/stat 22

    • /proc/sys/ 3 11a 11b

      • /proc/sys/kernel/ 3

        • /proc/sys/kernel/pid_max 3

      • /proc/sys/net/ 11a 11b

        • /proc/sys/net/core/ 11a

        • /proc/sys/net/ipv4/ 11a

        • /proc/sys/net/ipv6/ 11a

        • /proc/sys/net/unix/ 11b

  • /sys/devices/system/ 22

    • /sys/devices/system/cpu/ 22

      • /sys/devices/system/cpu/cpu[x]/ 22

        • /sys/devices/system/cpu/cpu[x]/cache/index[x]/ 22

          • /sys/devices/system/cpu/cpu[x]/cache/index[x]/coherency_line_size 22

          • /sys/devices/system/cpu/cpu[x]/cache/index[x]/level 22

          • /sys/devices/system/cpu/cpu[x]/cache/index[x]/number_of_sets 22

          • /sys/devices/system/cpu/cpu[x]/cache/index[x]/physical_line_partition 22

          • /sys/devices/system/cpu/cpu[x]/cache/index[x]/shared_cpu_map 22

          • /sys/devices/system/cpu/cpu[x]/cache/index[x]/size 22

          • /sys/devices/system/cpu/cpu[x]/cache/index[x]/type 22

        • /sys/devices/system/cpu/cpu[x]/online 22

        • /sys/devices/system/cpu/cpu[x]/topology/ 22

          • /sys/devices/system/cpu/cpu[x]/topology/core_id 22

          • /sys/devices/system/cpu/cpu[x]/topology/core_siblings 22

          • /sys/devices/system/cpu/cpu[x]/topology/physical_package_id 22

          • /sys/devices/system/cpu/cpu[x]/topology/thread_siblings 22

      • /sys/devices/system/cpu/kernel_max 22

      • /sys/devices/system/cpu/offline 22

      • /sys/devices/system/cpu/online 22

      • /sys/devices/system/cpu/possible 22

      • /sys/devices/system/cpu/present 22

    • /sys/devices/system/node/ 22

      • /sys/devices/system/node/node[x]/ 22

        • /sys/devices/system/node/node[x]/cpumap 22

        • /sys/devices/system/node/node[x]/distance 22

        • /sys/devices/system/node/node[x]/hugepages/ 22

          • /sys/devices/system/node/node[x]/hugepages/hugepages-[y]/nr_hugepages 22

        • /sys/devices/system/node/node[x]/meminfo 22


Linux features

Processes

Gramine supports multi-processing. A Gramine instance starts the first (main) process, as specified in the entrypoint of the manifest. The first process can spawn child processes, which belong to the same Gramine instance.

Gramine can execute ELF binaries (executables and libraries) and executable scripts. Gramine supports executing them as entrypoints and via execve() system call. In case of SGX backend, execve() execution replaces a calling program with a new program in the same SGX enclave.

Gramine supports creating child processes using fork(), vfork() and clone() system calls. vfork() is emulated via fork(). clone() always means a separate process with its own address space (i.e., CLONE_THREAD, CLONE_FILES, etc. flags cannot be specified). In case of SGX backend, child processes are created in a new SGX enclave.

It is possible to disallow creation of child processes, by specifying sys.disallow_subprocesses = true in the manifest. The intuition is that many applications have fallbacks when they fail to spawn a child process (e.g. Python). This can be useful in SGX environments: child processes consume EPC memory which is a limited resource.

Currently, Gramine does not fully support fork in multi-threaded applications. There is a known bug in Gramine that if one thread is performing fork and another thread modifies the internal Gramine state, the state may get corrupted (which may lead to failures).

Gramine supports process termination using exit() and exit_group() system calls. If there are child processes executing and the first process exits, Gramine currently does not kill child processes; this is however not a problem in practice because the host OS cleans up these orphaned children.

All aforementioned system calls follow Linux semantics, barring the mentioned peculiarities. However, properties of processes not supported by Gramine (e.g. namespaces, pidfd, etc.) are ignored.

Gramine does not support disassociating parts of the process execution context (via unshare() system call). Gramine does not support comparing two processes (via kcmp()).

Related system calls
  • execve()

  • execveat(): very rarely used by applications

  • clone(): except exotic combination CLONE_VM & !CLONE_THREAD & !CLONE_VFORK

  • fork()

  • vfork(): with the same semantics as fork()

  • exit()

  • exit_group()

  • clone3(): very rarely used by applications

  • unshare(): very rarely used by applications

  • kcmp(): very rarely used by applications

Additional materials
  • LD_LIBRARY_PATH environment variable is always propagated into new process, see the issue.


Threads

Gramine implements multi-threading. In case of SGX backend, all threads of one Gramine process run in the same SGX enclave.

Gramine implements per-thread:

  • information about signal (alternate) stack,

  • user/group IDs,

  • thread groups info,

  • signal mask, signal dispositions, signal queue,

  • futex robust list,

  • CPU affinity mask.

Gramine supports creating threads using clone(.. CLONE_VM | CLONE_THREAD ..) system call and destroying threads using exit() system call.

Gramine does not support manipulations of thread-local storage information (via get_thread_area() and set_thread_area() system calls). Instead, Gramine supports setting arch-specific (x86-specific) thread state via arch_prctl(ARCH_GET_FS) and arch_prctl(ARCH_SET_FS). Note that Gramine does not allow arch_prctl(ARCH_GET_GS) and arch_prctl(ARCH_SET_GS) – the GS register is reserved for Gramine internal usage.

Note on thread's stack size

Gramine sets the same stack size for each thread. Gramine does not support dynamic growth of the first-thread stack (as Linux does). The stack size in Gramine can be configured via the sys.stack.size manifest option.

Related system calls
  • clone(): must have combination CLONE_VM | CLONE_THREAD

  • exit()

  • get_thread_area(): very rarely used by applications

  • set_thread_area(): very rarely used by applications

  • prctl(): very rarely used by applications

  • arch_prctl(): only x86-specific subset of flags

    • ARCH_GET_FS

    • ARCH_SET_FS

    • ARCH_GET_GS

    • ARCH_SET_GS

  • clone3(): very rarely used by applications


Process and thread identifiers

Gramine supports the following identifiers: Process IDs (PIDs), Parent Process IDs (PPIDs), Thread IDs (TIDs). The corresponding system calls are getpid(), getppid(), gettid(), set_tid_address().

Gramine has dummy support for Process Group IDs (PGIDs): PGID can only be get/set for the current process. It is impossible to get/set PGIDs of other (e.g. child) processes. The corresponding system calls are getpgid(), getpgrp(), setpgid().

Gramine virtualizes process/thread identifiers. In other words, in-Gramine PIDs and TIDs have no correlation with host-OS PIDs and TIDs. Each Gramine instance starts a main process with PID 1.

Gramine implements a subset of pseudo-files under /proc/[pid]: more pseudo-files for the current process (aka /proc/self) and its threads, less pseudo-files for remote processes (e.g. children), and no pseudo-files for remote threads. See the list under “Related pseudo-files”.

Related system calls
  • getpid()

  • getppid()

  • gettid()

  • set_tid_address()

  • getpgid(): dummy, see above

  • setpgid(): dummy, see above

  • getpgrp(): dummy, see above

Related pseudo-files
  • /proc/[this-pid]/ (aka /proc/self/): only most important files implemented

    • /proc/[this-pid]/cmdline

    • /proc/[this-pid]/cwd

    • /proc/[this-pid]/exe

    • /proc/[this-pid]/fd

    • /proc/[this-pid]/maps

    • /proc/[this-pid]/root

    • /proc/[this-pid]/stat: partially implemented

      • pid, comm, ppid, pgrp, num_threads, vsize, rss

      • state: always indicates “R” (running)

      • flags: indicates only PF_RANDOMIZE

      • ☒ rest fields: always zero

    • /proc/[this-pid]/statm: partially implemented

      • size/VmSize, resident/VmRSS

      • ☒ rest fields: always zero

    • /proc/[this-pid]/status: partially implemented

      • VmPeak

      • ☒ rest fields: not printed

    • /proc/[this-pid]/task

  • /proc/[remote-pid]/: minimally implemented

    • /proc/[remote-pid]/cwd

    • /proc/[remote-pid]/exe

    • /proc/[remote-pid]/root

  • /proc/[local-tid]/: same as /proc/[this-pid]

  • /proc/[remote-tid]/: very rarely used by applications

  • /proc/sys/kernel/pid_max


Scheduling

Gramine does not perform scheduling of threads, instead it relies on the host OS to perform scheduling. In case of SGX backend, trying to perform or control scheduling would be futile because SGX threat model has no means of control or verification of scheduling decisions of the host OS.

Gramine fully implements only a few scheduling system calls: sched_yield(), sched_getaffinity(), sched_setaffinity(). Most other scheduling system calls in Gramine have dummy implementations: they return some default sensible values and they do not send requests to the host OS. Finally, sched_getattr() and sched_setattr() are not implemented in Gramine, as no applications use them. In other words, applications running in Gramine cannot set scheduling policy or thread priorities, and they cannot learn currently-used policy and priorities of the host OS. See the list under “Related system calls”.

These dummy implementations serve Gramine well. We have not yet encountered applications that would significantly benefit from scheduling system calls being properly implemented in Gramine.

To support CPU affinity masks and expose NUMA/CPU topology, Gramine implements /sys/devices/system/cpu/ and /sys/devices/system/node/ pseudo-files. See the list in the “System information and resource accounting” section.

Related system calls
  • sched_yield()

  • sched_getaffinity()

  • sched_setaffinity()

  • getcpu(): dummy, returns a random allowed CPU

  • getpriority(): dummy, returns default value

  • setpriority(): dummy, does nothing

  • sched_getparam(): dummy, returns default values

  • sched_setparam(): dummy, does nothing

  • sched_getscheduler(): dummy, returns default value

  • sched_setscheduler(): dummy, does nothing

  • sched_get_priority_max(): dummy, returns default value

  • sched_get_priority_min(): dummy, returns default

  • sched_rr_get_interval(): dummy, returns default value

  • sched_getattr(): very rarely used by applications

  • sched_setattr(): very rarely used by applications

  • ioprio_get(): very rarely used by applications

  • ioprio_set(): very rarely used by applications


Memory synchronization (futexes)

Gramine partially implements futexes.

Current implementation is limited to one process, i.e., threads calling the futex() system call on the same futex word must reside in the same process. Gramine does not support non-private futexes, thus Gramine always assumes the FUTEX_PRIVATE_FLAG flag. We have not yet encountered applications that would require inter-process futexes.

Gramine ignores the FUTEX_CLOCK_REALTIME flag.

Gramine supports the following futex operations: FUTEX_WAIT, FUTEX_WAIT_BITSET, FUTEX_WAKE, FUTEX_WAKE_BITSET, FUTEX_WAKE_OP, FUTEX_REQUEUE, FUTEX_CMP_REQUEUE. Priority-inheritance (PI) futexes and operations on them are not supported.

Gramine implements getting/setting the list of robust futexes, via get_robust_list() and set_robust_list() system calls.

Related system calls
  • futex(): see notes above

  • get_robust_list()

  • set_robust_list()

  • futex_waitv(): very rarely used by applications


Memory management

Gramine implements memory-management system calls: mmap(), mprotect(), munmap(), brk(), etc. Some exotic flags and features are not implemented, but we didn’t observe any applications that would fail or behave incorrectly because of that.

mmap() supports anonymous (MAP_ANONYMOUS) and file-backed (MAP_FILE) mappings. All commonly used flags like MAP_SHARED, MAP_PRIVATE, MAP_FIXED, MAP_FIXED_NOREPLACE, MAP_STACK, MAP_GROWSDOWN, MAP_32BIT are supported.

In case of SGX backend, MAP_SHARED flag is ignored for anonymous mappings, and for file-backed mappings, it depends on the type of file:

  • disallowed for trusted files (these files are read-only, thus the flag is meaningless),

  • disallowed for allowed files (for security reasons: it would be easy to abuse it),

  • allowed for encrypted files (but synchronization happens only on explicit system calls like msync() and close()).

MAP_NORESERVE’s original semantics are not implemented and it is silently ignored. However, in case of SGX backend and on systems supporting EDMM, MAP_NORESERVE flag is used as a lazy-allocation heuristic/hint for anonymous mappings – instead of pre-accepting the region of enclave pages on mmap requests, the enclave pages are lazily accepted on page-fault events.

MAP_LOCKED, MAP_POPULATE, MAP_NONBLOCK, MAP_HUGETLB, MAP_HUGE_2MB, MAP_HUGE_1GB flags are ignored (allowed but have no effect). MAP_SYNC flag is not supported.

mprotect() supports all flags except PROT_SEM and PROT_GROWSUP. We haven’t encountered any applications that would use these flags. In case of SGX backend, mprotect() behavior differs:

  • on systems supporting EDMM, mprotect() correctly applies permissions;

  • on systems not supporting EDMM, all enclave memory is allocated with Read-Write-Execute permissions, and mprotect() calls are silently ignored.

madvise() implements only a minimal subset of functionality:

  • MADV_DONTNEED is partially supported:

    • resetting writable file-backed mappings is not implemented;

    • all other cases are implemented.

  • MADV_NORMAL, MADV_RANDOM, MADV_SEQUENTIAL, MADV_WILLNEED, MADV_FREE, MADV_SOFT_OFFLINE, MADV_MERGEABLE, MADV_UNMERGEABLE, MADV_HUGEPAGE, MADV_NOHUGEPAGE are ignored (allowed but have no effect).

  • All other advice values are not supported.

Gramine does not support anonymous files (created via memfd_create()).

Quick summary of other memory-management system calls:

  • munmap() has nothing of note;

  • mremap() is not implemented (very rarely used by applications);

  • msync() implements only MS_SYNC and MS_ASYNC (MS_INVALIDATE is not implemented);

  • mbind() is a no-op;

  • mincore() always tells that pages are not in RAM;

  • set_mempolicy() and get_mempolicy are not implemented;

  • mlock(), munlock(), mlockall(), munlockall(), mlock2() are dummy (always return success).

As can be seen from above, many performance-improving system calls, flags and features are currently not implemented by Gramine. Keep it in mind when you observe application performance degradation.

Related system calls
  • brk()

  • mmap(): see above for notes

  • mprotect(): see above for notes

  • munmap()

  • msync(): does not implement MS_INVALIDATE

  • madvise(): see above for notes

  • mbind(): dummy

  • mincore(): dummy

  • mlock(): dummy

  • munlock(): dummy

  • mlockall(): dummy

  • munlockall(): dummy

  • mlock2(): dummy

  • mremap(): very rarely used by applications

  • remap_file_pages(): very rarely used by applications

  • set_mempolicy(): may be implemented in the future

  • get_mempolicy(): may be implemented in the future

  • memfd_create(): may be implemented in the future

  • memfd_secret(): very rarely used by applications

  • membarrier(): may be implemented in the future

  • move_pages(): very rarely used by applications

  • migrate_pages(): very rarely used by applications

  • process_madvise(): very rarely used by applications

  • process_mrelease(): very rarely used by applications

  • set_mempolicy_home_node(): very rarely used by applications


Overview of Inter-Process Communication (IPC)

Gramine implements most of the Linux IPC mechanisms. In particular:

  • ☑ Signals and process state changes

  • ☑ Pipes

  • ☑ FIFOs (named pipes)

  • ▣ UNIX domain sockets

  • ▣ File locking

  • ▣ Shared memory (untrusted, POSIX only)

  • ▣ Semaphores (untrusted, POSIX only)

  • ☒ Message queues

Gramine implements pipes, FIFOs and UNIX domain sockets (UDSes) via host-OS pipes. In case of SGX backend, all pipe, FIFO and UDS communication is transparently encrypted.

For all other IPC mechanisms – currently these are signals, process state changes, file locks – Gramine emulates them via internal message passing (in case of SGX, all messages are encrypted).

Thus, Gramine implements all IPC primitives using a single host-OS primitive: pipes. This design choice means that Gramine is a distributed Library OS, in contrast to the Linux kernel which is monolithic. Each Gramine process knows only about its own state and must query peer Gramine processes to learn their state; compare it to the Linux kernel which keeps a single state for all processes running on top of it. Thus, all IPC in Gramine is performed using message passing over host-OS pipes. To govern this message passing, the first Gramine process is designated a leader which controls all message requests/responses among processes in one Gramine instance. For example, if one Gramine process spawns a new child, it requests the leader to assign a PID for this child. As another example, all POSIX-locking operations are synchronized using a special messaging protocol that is managed by the leader.

Because of this Gramine peculiarity, IPC-intensive applications may experience performance degradation. Also, some IPC-related system calls and pseudo-files are not implemented in Gramine due to the complexity of message-passing implementation.

Gramine implements limited support for POSIX shared memory (but not for System V shared memory). Please note that in case of the SGX backend, implementation of shared memory is insecure. For more information, please refer to the corresponding manifest syntax and the corresponding section in this document.

Since Gramine has support for POSIX shared memory, consequently Gramine has support for POSIX semaphores (which are built on top of POSIX shared memory). In case of the SGX backend, implementation of POSIX semaphores is insecure, similarly to POSIX shared memory. Please refer to the corresponding section in this document.

To learn more about Gramine support for each of the Linux IPC mechanisms, refer to corresponding sections below.

Additional materials
  • For Linux IPC overview, we recommend reading Beej’s Guide to Unix IPC.

  • In case of SGX backend, pipes, FIFOs, UDSes and all other IPC communication are encrypted using the TLS-PSK (TLS with Pre-Shared Keys) protocol. The pre-shared key is randomly generated for each new Gramine instance. Before establishing any pipe/IPC communication, two Gramine processes (e.g., parent and child) verify each other’s trustworthiness using SGX local attestation.


Signals and process state changes

Gramine partially implements signals (see below for some limitations). For local signals (Gramine process signals itself, e.g. SIGABRT) and signals from the host OS (e.g. host sends SIGTERM), message passing is not involved. For process-to-process signals (e.g. child process sends SIGCHLD to the parent), message passing is used.

Gramine supports both standard signals and POSIX real-time signals. Queueing and delivery semantics are the same as in Linux. Per-thread signal masks are supported. Restart of system calls after signal handling (if flag SA_RESTART was specified) is supported.

Gramine implements signal dispositions, but some rarely used features are not implemented:

  • core dump files are never produced,

  • SA_NOCLDSTOP and SA_NOCLDWAIT signal-behavior flags are ignored,

  • only fields si_signo, si_code, si_pid, si_uid, si_status, si_addr in the data type siginfo_t are populated.

Gramine supports injecting a single SIGTERM signal from the host. No other signals from the host are supported. By default, Gramine ignores all signals sent by the host (including signals sent from other applications or from other Gramine instances). This limitation is for security reasons, relevant on SGX backend.

Gramine has some limitations on sending signals to processes and threads:

  • sending a signal to a process group is not supported (e.g. kill(0) sends the signal only to the current process but not to other processes),

  • tkill() system call cannot send signals to threads in other processes.

Gramine supports waiting for signals (via pause(), rt_sigsuspend(), etc. system calls).

Gramine supports waiting for processes via wait4() and waitid() system calls. However, WSTOPPED and WCONTINUED options are not supported (we didn’t encounter applications that rely on these options). Zombie processes are supported, though the “zombie” state is not reported in /proc/[pid]/stat pseudo-file.

Gramine does not currently support file descriptors for signals (via signalfd()). Also, since Gramine does not currently support pidfd, sending a signal via pidfd_send_signal() is not implemented. Gramine also does not support file descriptors for handling page faults (via userfaultfd()).

Gramine has limited support for pseudo-files that describe the state of remote processes/threads (files under /proc/[remote-pid]/ and /proc/[remote-tid]/). For details, refer to “Related pseudo-files” in the “Process and thread identifiers” section.

Related system calls
  • pause()

  • rt_sigaction()

  • rt_sigpending()

  • rt_sigprocmask()

  • rt_sigreturn()

  • rt_sigsuspend()

  • rt_sigtimedwait()

  • sigaltstack()

  • rt_sigqueueinfo(): very rarely used by applications

  • rt_tgsigqueueinfo(): very rarely used by applications

  • signalfd(): very rarely used by applications

  • signalfd4(): very rarely used by applications

  • pidfd_open(): very rarely used by applications

  • pidfd_getfd(): very rarely used by applications

  • pidfd_send_signal(): very rarely used by applications

  • process_madvise(): very rarely used by applications

  • process_mrelease(): very rarely used by applications

  • userfaultfd(): very rarely used by applications

  • kill(): process groups not supported

  • tkill(): remote threads not supported

  • tgkill()

  • wait4(): WSTOPPED and WCONTINUED not supported

  • waitid(): WSTOPPED and WCONTINUED not supported


User and group identifiers

Gramine has dummy support for the following identifiers:

  • Real user ID (UID) and Real group ID (GID),

  • Effective user ID (EUID) and Effective group ID (EGID),

  • Saved set-user-ID (SUID) and Saved set-group-ID (SGID).

The corresponding system calls are:

  • getuid(), getgid(), setuid(), setgid() for UID and GID (implemented);

  • geteuid(), getegid() for EUID and EGID (implemented);

  • setreuid(), setregid() for UID + EUID and GID + EGID (not implemented);

  • getresuid(), setresuid(), getresgid(), setresgid() for UID + EUID + SUID and GID + EGID + SGID (not implemented).

Gramine starts the application with UID = EUID = SUID and equal to loader.uid manifest option. Similarly, the application is started with GID = EGID = SGID and equal to loader.gid. If these manifest options are not set, then all IDs are equal to zero, which means root user.

During execution, the application may modify these IDs, and the changes will be visible inside the Gramine environment.

Gramine does not support Filesystem user ID (FSUID) and filesystem group ID (FSGID). The corresponding system calls are setfsuid() and setfsgid() (not implemented).

Gramine has dummy support for Supplementary group IDs. The corresponding system calls are getgroups() and setgroups(). Gramine starts the applications with an empty set of supplementary groups. The application may modify this set, and the changes will be visible inside the Gramine environment.

Currently, there are only two usages of user/group IDs in Gramine:

  • changing ownership of a file via chown() and similar system calls;

  • passing user ID in the SIGCHLD signal information on child process termination (in siginfo_t::si_uid).

Gramine does not currently implement user/group ID fields in the /proc/[pid]/status pseudo-file.

Related system calls
  • getuid(): dummy

  • getgid(): dummy

  • setuid(): dummy

  • setgid(): dummy

  • geteuid(): dummy

  • getegid(): dummy

  • getgroups(): dummy

  • setgroups(): dummy

  • setreuid(): very rarely used by applications, may be implemented in the future

  • setregid(): very rarely used by applications, may be implemented in the future

  • getresuid(): very rarely used by applications, may be implemented in the future

  • setresuid(): very rarely used by applications, may be implemented in the future

  • getresgid(): very rarely used by applications, may be implemented in the future

  • setresgid(): very rarely used by applications, may be implemented in the future

  • setfsuid(): very rarely used by applications

  • setfsgid(): very rarely used by applications

Related pseudo-files
  • /proc/[this-pid]/status: fields Uid, Gid, Groups are not implemented


File systems

Gramine implements filesystem operations, but with several peculiarities and limitations.

The most important peculiarity is that Gramine does not simply mirror the host OS’s directory hierarchy. Instead, Gramine constructs its own view on the selected subset of host’s directories and files: this is controlled by the manifest’s FS mount points (fs.mounts). This feature is similar to the volumes concept in Docker. This Gramine feature is introduced for security.

Another peculiarity is that Gramine provides several types of filesystem mounts:

  • passthrough mounts (contain unencrypted files, see below),

  • encrypted mounts (contain files that are automatically encrypted and integrity-protected).

In case of SGX backend, passthrough mounts must be of one of two kinds:

  • containing allowed files (not encrypted or cryptographically hashed),

  • containing trusted files (cryptographically hashed – effectively, their contents are mixed into MRENCLAVE on SGX).

Additionally, mounts may be hosted in one of two ways:

  • on the host OS (in passthrough mounts),

  • inside the Gramine process (in tmpfs mounts).

All files potentially used by the application must be specified in the manifest file. Instead of single files, whole directories can be specified. Refer to the manifest documentation for more details.

Gramine also provides a subset of pseudo-files that can be found in a Linux kernel. In particular, Gramine automatically populates /proc, /dev and /sys pseudo-filesystems with most widely used pseudo-files. These pseudo-files cannot be deleted. The complete list can be found in the “List of pseudo-files” section.

The final peculiarity is that Gramine is a distributed Library OS, as discussed in “Overview of Inter-Process Communication (IPC)” section. This means that each Gramine process knows only about its own FS state at any point in time, and must consult the host OS and/or other Gramine processes to learn about any updates. Synchronizing the FS state is a difficult task, and Gramine has only limited support for file sync. For example, two Gramine processes may want to append data to the same file, but Gramine currently does not synchronize such accesses, thus the file contents will be incorrectly overwritten.

Internally, FS implementation in Gramine follows the one in Linux kernel. Gramine implements a Virtual File System (VFS), a uniform interface for various mount types. Gramine also has the concepts of dentries (cached directory/file names for fast lookup) and inodes (metadata about files).

Gramine does not implement full filesystem stack by design. Gramine relies on the host filesystem for most operations. The only exceptions are the tmpfs filesystem and the pseudo-filesystems (implemented entirely inside Gramine).

General FS limitations in Gramine include:

  • no support for dynamic mounting: all mounts must be specified beforehand in the manifest;

  • no operations across mounts, e.g., no rename of file located in one mount to another one (note that Linux also doesn’t support such operations);

  • no synchronization of file offsets, file sizes, etc. between Gramine processes;

  • tmpfs mounts (in-memory filesystems) are not shared by Gramine processes;

  • File timestamps (access, modified, change timestamps) are not set/updated.

Additional materials

A mechanism for FS synchronization, as well as a general redesign of certain FS components, is a task Gramine will tackle in the future. Below are some discussions and RFCs:


File system operations

Gramine implements all classic filesystem operations, but with limitations described below.

Gramine supports opening files and directories (via open() and openat() system calls). O_CLOEXEC, O_CREAT, O_DIRECTORY, O_EXCL, O_NOFOLLOW, O_PATH, O_TRUNC flags are supported. Other flags are ignored. Notable ignored flags are O_APPEND (not yet implemented in Gramine) and O_TMPFILE (bug in Gramine: should not be silently ignored).

Trusted files can be opened only for reading. Already-existing encrypted files can be opened only if they were not moved or renamed on the host (this is for protection against file renaming attacks).

Gramine supports creating files and directories (via creat(), mkdir(), mkdirat() system calls), reading directories (via getdents()), deleting files and directories (via unlink(), unlinkat(), rmdir()), renaming files and directories (via rename() and renameat()).

Gramine supports read and write operations on files. Appending to files is currently unsupported. Writing to trusted files is prohibited.

Gramine supports seek operations on files (lseek()). However, seek operation happens entirely inside Gramine (by changing the file offset), and thus may behave incorrectly on host’s device files (which may reimplement the seek operation in a special way).

Gramine supports mmap and msync operations on files. For more information, see the “Memory management” section.

Gramine has dummy support for polling on files via poll(), ppoll(), select() system calls. Regular files always return events “there is data to read” and “writing is possible”. Other files return an error code.

Gramine does not support epoll on files.

Gramine supports file flushes (via fsync() and fdatasync()). However, flushing filesystem metadata (sync() and syncfs()) is not supported. Similarly, sync_file_range() system call is currently not supported.

Gramine supports file truncation (via truncate() and ftruncate()).

Gramine has very limited support of fallocate() system call. Only mode 0 is supported (“allocating disk space”). The emulation of this mode simply extends the file size if applicable, otherwise does nothing. In other words, this system call doesn’t provide reliability or performance guarantees.

Gramine has dummy support of fadvise64() system call. The emulation does nothing and always returns success. In other words, this system call doesn’t provide any performance improvement.

Gramine has support for file mode bits. The chmod(), fchmodat(), fchmod() system calls correctly set the file mode. The umask() system call is also supported.

Gramine has dummy support for file owner and group manipulations. In Gramine, users and groups are dummy; see the “User and group identifiers” section for details. Therefore, chown(), fchownat(), fchown() system calls update UID and GID inside the Gramine environment, but not on host files.

Gramine supports checking permissions on the file via access() and faccessat() system calls. Recall however that users and groups are dummy in Gramine, thus the checks are also largely irrelevant.

Gramine implements sendfile() system call. However, this system call is emulated in an inefficient way (for simplicity), especially in multi-threaded cases. Pay attention to this if your application relies heavily on sendfile().

Gramine supports directory operations: chdir() and fchdir() to change the working directory, and getcwd() to get the current working directory.

Gramine partially supports getting file status (information about files), via stat(), lstat(), fstat(), newfstatat() system calls. The only fields populated in the output buffer are st_mode, st_size, st_uid, st_gid, st_blksize (with hard-coded value), st_nlink (with hard-coded value), st_dev, st_ino. Note that Gramine currently doesn’t support links, so lstat() always resolves to a file (never to a symlink).

Gramine has dummy support for getting filesystem statistics via statfs() and fstatfs(). The only fields populated in the output buffer are f_bsize, f_blocks, f_bfree and f_bavail, and they all have hard-coded values.

Gramine currently does not support changing file access/modification times, via utime(), utimes(), futimesat(), utimensat() system calls.

Mounting files and directories with extended attributes (xattr) or setting them via setxattr(), lsetxattr(), fsetxattr(), removexattr(), lremovexattr(), fremovexattr() is not supported. Reading is supported (getxattr(), lgetxattr(), fgetxattr(), listxattr(), llistxattr(), flistxattr()) but always returns no attributes (which is a correct result in our case).

Related system calls
  • open(): implemented, with limitations

  • openat(): implemented, with limitations

  • close()

  • close_range()

  • creat()

  • mkdir()

  • mkdirat()

  • getdents()

  • getdents64()

  • unlink()

  • unlinkat()

  • rmdir()

  • rename(): cannot rename across mounts

  • renameat(): cannot rename across mounts

  • read()

  • pread64()

  • readv()

  • preadv()

  • write()

  • pwrite64()

  • writev()

  • pwritev()

  • lseek(): see note above

  • mmap(): see notes above

  • msync(): see notes above

  • select(): dummy

  • pselect6(): dummy

  • poll(): dummy

  • ppoll(): dummy

  • fsync()

  • fdatasync()

  • truncate()

  • ftruncate()

  • fallocate(): dummy

  • fadvise64(): dummy

  • chmod()

  • fchmod()

  • fchmodat()

  • chown(): dummy

  • fchown(): dummy

  • fchownat(): dummy

  • access(): dummy

  • faccessat(): dummy

  • umask()

  • sendfile(): unoptimized

  • chdir()

  • fchdir()

  • getcwd()

  • stat(): partially dummy

  • fstat(): partially dummy

  • lstat(): partially dummy, always resolves to actual file

  • newfstatat(): partially dummy

  • statfs(): partially dummy

  • fstatfs(): partially dummy

  • chroot()

  • name_to_handle_at(): very rarely used by applications

  • open_by_handle_at(): very rarely used by applications

  • openat2(): very rarely used by applications

  • renameat2(): very rarely used by applications

  • preadv2(): very rarely used by applications

  • pwritev2(): very rarely used by applications

  • epoll_create(): very rarely used by applications

  • epoll_create1(): very rarely used by applications

  • epoll_wait(): very rarely used by applications

  • epoll_pwait(): very rarely used by applications

  • epoll_pwait2(): very rarely used by applications

  • epoll_ctl(): very rarely used by applications

  • sync(): very rarely used by applications

  • syncfs(): very rarely used by applications

  • sync_file_range(): very rarely used by applications

  • faccessat2(): very rarely used by applications

  • statx(): very rarely used by applications

  • sysfs(): very rarely used by applications

  • ustat(): very rarely used by applications

  • mount(): very rarely used by applications

  • move_mount(): very rarely used by applications

  • umount2(): very rarely used by applications

  • mount_setattr(): very rarely used by applications

  • pivot_root(): very rarely used by applications

  • utime(): may be implemented in the future

  • utimes(): may be implemented in the future

  • futimesat(): may be implemented in the future

  • utimensat(): may be implemented in the future

  • getxattr()

  • lgetxattr()

  • fgetxattr()

  • listxattr()

  • llistxattr()

  • flistxattr()

  • removexattr()

  • lremovexattr()

  • fremovexattr()

  • setxattr()

  • lsetxattr()

  • fsetxattr()


File locking

File locking operations can be considered one of the IPC mechanisms, as discussed in “Overview of Inter-Process Communication (IPC)” section. Thus, file locks are implemented via message passing in Gramine, and all lock-requests are handled in the main (leader) process.

Gramine currently implements two types of file locks:

  • POSIX (fcntl) locks aka Advisory record locks. In particular, the following operations are implemented: fcntl(F_SETLK), fcntl(F_SETLKW) and fcntl(F_GETLK).

  • BSD (flock) locks. The following system call is implemented: flock(). Its support is currently experimental and not suitable for production.

Both types of file locks share the same internal implementation in Gramine. The current implementation has the following caveats:

  • Lock requests from other processes will always have the overhead of IPC round-trip, even if the lock is uncontested.

  • The main process has to be able to look up the same file, so locking will not work for files in local-process-only filesystems (e.g. tmpfs).

  • There is no deadlock detection (EDEADLK). This is only applicable to POSIX locks; BSD locks do not have deadlock detection in the first place.

  • The lock requests cannot be interrupted (EINTR).

  • The locks work only on regular files (no pipes, sockets etc.).

Similarly to Linux, BSD (flock) locks ignore deprecated LOCK_{MAND,READ,WRITE,RW} operations.

BSD (flock) locks are currently experimental and are disabled by default. To enable them, use the sys.experimental__enable_flock manifest option. There is at least one problem with BSD locks currently: they are supposed to be released when the last reference (file descriptor, or FD) to the underlying opened file is closed, including when a process with the opened file terminates. Unfortunately, Gramine lacks system-wide tracking of opened files’ FDs. This may lead to premature releases of flock locks in some situations.

Related system calls
  • fcntl()

    • F_SETLK: see notes above

    • F_SETLKW: see notes above

    • F_GETLK: see notes above

  • flock(): experimental, see notes above


Monitoring filesystem events (inotify, fanotify)

Gramine does not currently implement inotify and fanotify APIs. Gramine could implement them in the future, if need arises.

Related system calls
  • inotify_init()

  • inotify_init1()

  • inotify_add_watch()

  • inotify_rm_watch()

  • fanotify_init()

  • fanotify_mark()


Pipes and FIFOs (named pipes)

Pipes and FIFOs are emulated in Gramine directly as host-level pipes (to be more specific, as socketpairs for Linux hosts). In case of SGX backend, pipes and FIFOs are transparently encrypted. For additional information on general properties of IPC in Gramine, see the “Overview of Inter-Process Communication (IPC)” section.

Gramine does not allow pipe/FIFO communication between Gramine processes and the host. Gramine also does not allow communication between Gramine processes from two different Gramine instances. Communication on pipes/FIFOs is possible only between two Gramine processes in the same Gramine instance.

Gramine does not allow more than two parties on one pipe/FIFO. For example, it is impossible to implement an SPMC (Single Producer Multiple Consumers) queue using a single pipe/FIFO. (We have not encountered applications that would try to use such patterns though.)

Gramine supports creating pipes (via pipe() and pipe2()) and FIFOs (via mknod(S_ISFIFO) and mknodat(S_ISFIFO)). The O_DIRECT flag while creating pipes with pipe2() is ignored. Blocking and non-blocking pipes/FIFOs (O_NONBLOCK flag) are supported.

Gramine supports read and write operations on pipes and FIFOs. Gramine supports generation of the SIGPIPE signal on write operation if the read end of a pipe has been closed. Polling on pipes and FIFOs is supported.

Gramine supports getting information about pipes/FIFOs via the fstat() and newfstatat() system calls. The only fields populated in the output buffer are st_uid, st_gid and st_mode. Gramine also supports getting the number of unread bytes in the pipe via ioctl(FIONREAD).

Gramine supports getting and setting pipe/FIFO status flags via fcntl(F_GETFL) and fcntl(F_SETFL). The only currently supported flag is O_NONBLOCK; O_ASYNC is not supported. Gramine also supports setting blocking/non-blocking mode via ioctl(FIONBIO).

Related system calls
  • pipe()

  • pipe2(): O_DIRECT flag is ignored

  • mknod(): S_ISFIFO type is supported

  • mknodat(): S_ISFIFO type is supported

  • close()

  • fstat()

  • read()

  • readv()

  • write()

  • writev()

  • select()

  • pselect6()

  • poll()

  • ppoll()

  • epoll_create()

  • epoll_create1()

  • epoll_wait()

  • epoll_pwait()

  • epoll_ctl()

  • epoll_pwait2(): very rarely used by applications

  • sendfile(): unoptimized

  • fcntl()

    • F_GETFL: only O_NONBLOCK

    • F_SETFL: only O_NONBLOCK

    • F_GETPIPE_SZ: very rarely used by applications

    • F_SETPIPE_SZ: very rarely used by applications

  • ioctl()

    • FIONREAD

    • FIONBIO


Networking (sockets)

Gramine supports the most important networking protocols. In particular, Gramine supports only the following protocol families:

  • AF_INET (IPv4 Internet protocols, e.g. TCP/IP and UDP/IP),

  • AF_INET6 (IPv6 Internet protocols, e.g. TCP/IP and UDP/IP),

  • AF_UNIX aka AF_LOCAL (UNIX domain sockets).

Gramine supports only two types of sockets:

  • SOCK_STREAM (connection-based byte streams),

  • SOCK_DGRAM (connectionless datagrams).

Gramine supports TCP/IP sockets and UDP/IP sockets, i.e. the combinations AF_INET/AF_INET6 + SOCK_STREAM and AF_INET/AF_INET6 + SOCK_DGRAM respectively. Gramine supports stream UNIX domain sockets (AF_UNIX + SOCK_STREAM), but does not support datagram UNIX domain sockets (AF_UNIX + SOCK_DGRAM).

Non-blocking sockets (SOCK_NONBLOCK) are supported. Non-blocking connects are supported, i.e., cases when connect() returns -EINPROGRESS are supported.

Generation of the SIGPIPE signal on send operation if the receive end of a socket has been closed is supported.

Gramine does not implement full network stack by design. Gramine relies on the host network stack for most operations.

Other networking limitations in Gramine include:

  • no support for auto binding in the listen() system call;

  • dummy support for ancillary data (aka control messages): received messages always indicate there is no ancillary data attached to them.

TCP/IP and UDP/IP sockets

TCP/IP and UDP/IP sockets (TCP and UDP for short) support all Berkeley sockets APIs, including socket(), bind(), listen(), connect(), accept(), send(), recv(), getsockopt(), setsockopt(), getsockname(), getpeername(), shutdown(), etc. system calls. Polling on TCP and UDP sockets via poll(), ppoll(), select(), epoll_*() system calls is supported.

TCP sockets support only MSG_NOSIGNAL, MSG_DONTWAIT and MSG_MORE flags in send(), sendto(), sendmsg(), sendmmsg() system calls. Note that MSG_MORE flag is ignored. UDP sockets support only MSG_NOSIGNAL and MSG_DONTWAIT flags.

TCP sockets support only MSG_PEEK, MSG_DONTWAIT and MSG_TRUNC flags in recv(), recvfrom(), recvmsg(), recvmmsg() system calls. UDP sockets support only MSG_DONTWAIT and MSG_TRUNC flags.

TCP and UDP sockets support the following socket options:

  • SO_ACCEPTCONN, SO_DOMAIN, SO_TYPE, SO_PROTOCOL, SO_ERROR (all read-only),

  • SO_RCVTIMEO, SO_SNDTIMEO, SO_REUSEADDR, SO_REUSEPORT, SO_BROADCAST, SO_KEEPALIVE, SO_LINGER, SO_RCVBUF, SO_SNDBUF,

  • IPV6_V6ONLY,

  • IP_RECVERR, IPV6_RECVERR (allowed but ignored).

TCP sockets additionally support the following socket options: TCP_CORK, TCP_KEEPIDLE, TCP_KEEPINTVL, TCP_KEEPCNT, TCP_NODELAY and TCP_USER_TIMEOUT.

Note on domain names configuration
  • To use libc name-resolving Berkeley socket APIs like gethostbyname(), gethostbyaddr(), getaddrinfo, one must enable the sys.enable_extra_runtime_domain_names_conf manifest option.

Related system calls
  • socket(): see notes above

  • bind()

  • listen()

  • accept()

  • accept4()

  • connect()

  • close()

  • shutdown()

  • getsockname()

  • getpeername()

  • getsockopt()

  • setsockopt()

  • fstat()

  • read()

  • readv()

  • write()

  • writev()

  • recv(): see supported flags above

  • recvfrom(): see supported flags above

  • recvmsg(): see supported flags above

  • recvmmsg(): see supported flags above

  • send(): see supported flags above

  • sendto(): see supported flags above

  • sendmsg(): see supported flags above

  • sendmmsg(): see supported flags above

  • select()

  • pselect6()

  • poll()

  • ppoll()

  • epoll_create()

  • epoll_create1()

  • epoll_wait()

  • epoll_pwait()

  • epoll_ctl()

  • epoll_pwait2(): very rarely used by applications

  • sendfile(): unoptimized

  • fcntl()

    • F_GETFL: only O_NONBLOCK

    • F_SETFL: only O_NONBLOCK

  • ioctl()

    • FIONREAD

    • FIONBIO

Related pseudo-files
  • /proc/sys/net/core/

  • /proc/sys/net/ipv4/

  • /proc/sys/net/ipv6/


UNIX domain sockets

UNIX domain sockets (UDSes) are emulated in Gramine directly as host-level pipes (to be more specific, as socketpairs for Linux hosts). In case of SGX backend, UDSes are transparently encrypted. For additional information on general properties of IPC in Gramine, see the “Overview of Inter-Process Communication (IPC)” section.

Gramine does not allow UDS communication between Gramine processes and the host. Gramine also does not allow communication between Gramine processes from two different Gramine instances. Communication on UDSes is possible only between two Gramine processes in the same Gramine instance. See also the “Pipes and FIFOs (named pipes)” section.

UDSes support all Berkeley sockets APIs, including socket(), bind(), listen(), connect(), accept(), send(), recv(), getsockopt(), setsockopt(), getsockname(), getpeername(), shutdown(), etc. system calls. Polling on UDSes via poll(), ppoll(), select(), epoll_*() system calls is supported.

Named UDSes are currently not visible on the Gramine filesystem (they do not have a corresponding dentry). This may be implemented in near future, please see the note below.

UDSes do not support ancillary data (aka control messages) in sendmsg() and recvmsg() system calls. In particular, the SCM_RIGHTS type is not supported; support for this type may be added in the future.

Gramine does not support connect() system call on an already bound UDS (via bind()).

UDSes support only MSG_NOSIGNAL, MSG_DONTWAIT and MSG_MORE flags in send(), sendto(), sendmsg(), sendmmsg() system calls. Note that MSG_MORE flag is ignored.

UDSes support only MSG_PEEK, MSG_DONTWAIT and MSG_TRUNC flags in recv(), recvfrom(), recvmsg(), recvmmsg() system calls.

UDSes support the following socket options:

  • SO_ACCEPTCONN, SO_DOMAIN, SO_TYPE, SO_PROTOCOL, SO_ERROR (all read-only),

  • SO_REUSEADDR (ignored, same as in Linux).

Note on named UDSes
Related system calls
Related pseudo-files

I/O multiplexing

Gramine implements I/O multiplexing system calls: select(), pselect6(), poll(), ppoll(), as well as the epoll family of system calls (epoll_*()). All these system calls are emulated via the ppoll() Linux-host system call.

Gramine supports I/O multiplexing on pipes, FIFOs, sockets and eventfd. For peculiarities of regular-files support, see the “File system operations” section.

Timeouts and signal masks are honoured. Timeout is updated on return from corresponding system calls.

Edge-triggered and level-triggered events in epoll are supported (the EPOLLET flag). EPOLLONESHOT, EPOLL_NEEDS_REARM flags are supported. EPOLLWAKEUP flag is ignored because Gramine does not implement autosleep.

Select and poll families of system calls are implemented in Gramine.

Poll/ppoll system calls have the following limitation:

  • POLLRDHUP is always reported together with POLLHUP.

Epoll family of system calls has the following limitations:

  • No sharing of an epoll instance between processes; updates in one process (e.g. adding an fd to be monitored) won’t be visible in the other process.

  • EPOLLEXCLUSIVE is a no-op; this is correct semantically, but may reduce performance of apps using this flag.

  • Adding an epoll to another epoll instance is not currently supported.

  • EPOLLRDHUP is always reported together with EPOLLHUP.

Related system calls
  • select()

  • pselect6()

  • poll()

  • ppoll()

  • epoll_create(): see notes above

  • epoll_create1(): see notes above

  • epoll_wait(): see notes above

  • epoll_pwait(): see notes above

  • epoll_ctl(): see notes above

  • epoll_pwait2(): very rarely used by applications


Asynchronous I/O

There are two asynchronous I/O APIs in Linux kernel:

  • Linux POSIX asynchronous I/O (Linux AIO, older API with io_setup() etc.),

  • I/O uring (io_uring, newer API with io_uring_setup() etc.).

Gramine does not currently implement either of these APIs. Gramine could implement them in the future, if need arises.

Note that AIO provided in userspace by glibc (aio_read(), aio_write(), etc.) does not depend on Gramine and is supported.

Related system calls
  • io_setup()

  • io_destroy()

  • io_getevents()

  • io_submit()

  • io_cancel()

  • io_uring_setup()

  • io_uring_enter()

  • io_uring_register()


Event notifications (eventfd)

There are two modes of eventfd:

  1. Secure “emulate-in-Gramine” – the eventfd object is created inside Gramine, and all operations are resolved entirely inside Gramine. A dummy eventfd object is created on the host, purely to trigger read/write notifications (e.g., in epoll); eventfd values are verified inside Gramine and are never exposed to the host. Since the host is used purely for notifications, a malicious host can only induce Denial of Service (DoS) attacks; thus this implementation is secure and enabled by default. This implementation is automatically disabled if sys.insecure__allow_eventfd manifest option is enabled.

    The emulation is currently implemented at the level of a single process. The emulation may work for multi-process applications, e.g., if the child process inherits the eventfd object but doesn’t use it. However, all eventfds created in the parent process are marked as invalid in child processes, i.e. inter-process communication via eventfds is not allowed.

    Note that this secure version is not able to receive events from the host OS.

  2. Insecure “passthrough-to-host” – the eventfd object is created on the host, and all operations are delegated to the host. Since this implementation is insecure, it is disallowed by default. To use this implementation, it must be explicitly allowed via the sys.insecure__allow_eventfd manifest option.

Gramine supports polling on eventfd via poll(), ppoll(), select(), epoll_*() system calls, in both secure and insecure modes.

Related system calls
  • eventfd(): see notes above

  • eventfd2(): see notes above

  • close()

  • read()

  • write()

  • select()

  • pselect6()

  • poll()

  • ppoll()

  • epoll_create()

  • epoll_create1()

  • epoll_wait()

  • epoll_pwait()

  • epoll_ctl()

  • epoll_pwait2(): very rarely used by applications


Semaphores

There are two semaphore APIs in Linux kernel:

  • System V semaphores (older API),

  • POSIX semaphores (newer API).

POSIX semaphores are technically not a Linux kernel API. Instead, they are implemented on top of the POSIX shared memory functionality of Linux by libc (i.e., via /dev/shm pseudo-filesystem).

Gramine currently has limited support for POSIX semaphores. Gramine does not implement System V semaphores.

Please note that in case of the SGX backend, implementation of POSIX semaphores is insecure, as semaphores are placed in shared memory which by design is allocated in untrusted non-enclave memory, and there is no way for Gramine to intercept memory accesses to shared memory regions (to provide some security guarantees).

Related system calls
  • semget()

  • semop()

  • semtimedop()

  • semctl()

Related pseudo-files
  • /dev/shm: partially implemented, insecure by itself, see here


Message queues

There are two message-queue APIs in Linux kernel:

  • System V message queue (older API),

  • POSIX message queue (newer API).

Gramine does not currently implement either of these APIs. Gramine could implement them in the future, if need arises.

Related system calls
  • msgget()

  • msgctl()

  • msgrcv()

  • msgsnd()

  • mq_open()

  • mq_getsetattr()

  • mq_notify()

  • mq_timedreceive()

  • mq_timedsend()

  • mq_unlink()


Shared memory

There are two shared-memory APIs in Linux kernel:

  • System V shared memory (older API),

  • POSIX shared memory (newer API).

Gramine currently has limited support for POSIX shared memory, targeted for special use cases like communication with hardware accelerators (e.g. GPUs).

Gramine does not implement System V shared memory.

Please note that in case of the SGX backend, implementation of shared memory is insecure, as shared memory by design is allocated in untrusted non-enclave memory, and there is no way for Gramine to intercept memory accesses to shared memory regions (to provide some security guarantees). It is the responsibility of the app developer to correctly use shared memory, with security implications in mind.

For more information, please refer to the corresponding manifest syntax. Also see this whitepaper.

Related system calls
  • shmget()

  • shmat()

  • shmctl()

  • shmdt()

Related pseudo-files
  • /dev/shm: partially implemented, insecure by itself


IOCTLs

By default, Gramine implements only a minimal set of IOCTL request codes. See the list under “Related system calls”.

It is possible to specify arbitrary IOCTLs (with arbitrary request codes and corresponding IOCTL data structures), targeted for special use cases like communication with hardware accelerators (e.g. GPUs) or implementing socket-related IOCTLs. This is achieved via sys.ioctl_structs and sys.allowed_ioctls manifest options. Read the documentation to learn how to use this feature. There is also a corresponding whitepaper on communication with hardware accelerators. Note that arbitrary IOCTLs specified in the manifest are pass-through and thus potentially insecure by themselves in e.g. SGX environments!

Related system calls
  • ioctl()

    • TIOCGPGRP: dummy

    • FIONBIO

    • FIONCLEX

    • FIOCLEX

    • FIOASYNC

    • FIONREAD

    • ▣ other IOCTLs via sys.ioctl_structs and sys.allowed_ioctls manifest options


Date and time

Gramine partially implements getting date/time: gettimeofday(), time(), clock_gettime(), clock_getres() system calls.

Gramine does not distinguish between different clocks available for clock_gettime() and clock_getres(). All clocks are emulated via the CLOCK_REALTIME clock.

Gramine does not support setting or adjusting date/time: settimeofday(), clock_settime(), adjtimex(), clock_adjtime().

Gramine does not currently support getting process times (like user time, system time): times().

Note on trustworthiness of date/time on SGX

In case of SGX backend, date/time cannot be trusted because it is queried from the possibly malicious host OS. There is currently no solution to this limitation.

Related system calls
  • gettimeofday()

  • time()

  • clock_gettime(): all clocks emulated via CLOCK_REALTIME

  • clock_getres(): all clocks emulated via CLOCK_REALTIME

  • settimeofday(): very rarely used by applications

  • clock_settime(): very rarely used by applications

  • adjtimex(): very rarely used by applications

  • clock_adjtime(): very rarely used by applications

  • times(): may be implemented in the future


Sleeps, timers and alarms

Gramine implements sleep system calls: nanosleep() and clock_nanosleep(). For the latter system call, all clocks are emulated via the CLOCK_REALTIME clock. TIMER_ABSTIME is supported. Both system calls correctly update the remaining time if they were interrupted by a signal handler.

Gramine implements getting and setting the interval timer: getitimer() and setitimer(). Only ITIMER_REAL is supported.

Gramine implements alarm clocks via alarm().

Gramine does not currently implement the POSIX per-process timer: timer_create(), etc. Gramine also does not currently implement timers that notify via file descriptors. Gramine could implement these timers in the future, if need arises.

Related system calls
  • nanosleep()

  • clock_nanosleep(): all clocks emulated via CLOCK_REALTIME

  • getitimer(): only ITIMER_REAL

  • setitimer(): only ITIMER_REAL

  • alarm()

  • timer_create(): may be implemented in the future

  • timer_settime(): may be implemented in the future

  • timer_gettime(): may be implemented in the future

  • timer_getoverrun(): may be implemented in the future

  • timer_delete(): may be implemented in the future

  • timerfd_create(): may be implemented in the future

  • timerfd_settime(): may be implemented in the future

  • timerfd_gettime(): may be implemented in the future


Randomness

Gramine implements obtaining random bytes via two Linux APIs:

  • getrandom() system call,

  • /dev/random and /dev/urandom pseudo-files.

In case of SGX backend, Gramine always uses only one source of random bytes: the RDRAND x86 instruction. This is a secure source of randomness.

Related system calls
  • getrandom()

Related pseudo-files
  • /dev/random

  • /dev/urandom


System information and resource accounting

Gramine does not support getting resource usage metrics via the getrusage() system call.

Gramine reports only minimal set of system information via the sysinfo() system call: only totalram, totalhigh, freeram and freehigh fields are populated.

Gramine reports only minimal set of kernel information via the uname() system call: only sysname, nodename, release, version, machine and domainname fields are populated. Out of these, only nodename is populated with host-provided name. The rest fields are hard-coded (e.g. release is currently hard-coded to 3.10.0).

Gramine has dummy support for setting hostname and domain name via sethostname() and setdomainname(). The set names are not propagated to the host OS or other Gramine processes.

Gramine has minimal and mostly dummy support for getting and setting resource limits, via getrlimit(), setrlimit(), prlimit64(). The prlimit64() syscall can be issued only on the current process. The following resources are supported:

  • RLIMIT_CPU – dummy, no limit by default

  • RLIMIT_FSIZE – dummy, no limit by default

  • RLIMIT_DATA – implemented, affects brk() system call

  • RLIMIT_STACK – dummy, equal to sys.stack.size manifest option by default

  • RLIMIT_CORE – dummy, zero by default

  • RLIMIT_RSS – dummy, no limit by default

  • RLIMIT_NPROC – dummy, no limit by default

  • RLIMIT_NOFILE – implemented, equal to sys.fds.limit manifest option by default

  • RLIMIT_MEMLOCK – dummy, no limit by default

  • RLIMIT_AS – dummy, no limit by default

  • RLIMIT_LOCKS – dummy, no limit by default

  • RLIMIT_SIGPENDING – dummy, no limit by default

  • RLIMIT_MSGQUEUE – dummy, ~800K by default

  • RLIMIT_NICE – dummy, zero by default

  • RLIMIT_RTPRIO – dummy, zero by default

  • RLIMIT_RTTIME – dummy, no limit by default

Gramine supports the /proc/cpuinfo, /proc/meminfo, /proc/stat pseudo-files with system information. In addition, Gramine supports CPU- and NUMA-node-specific pseudo-files under /sys/devices/system/cpu/ and /sys/devices/system/node/. See the list under “Related pseudo-files”. For additional pseudo-files containing process-specific information, see the “Process and thread identifiers” section.

Related system calls
  • getrusage()

  • sysinfo(): only totalram, totalhigh, freeram and freehigh

  • uname(): only sysname, nodename, release, version, machine and domainname

  • sethostname(): dummy

  • setdomainname(): dummy

  • getrlimit(): see notes above

  • setrlimit(): see notes above

  • prlimit64(): see notes above

Related pseudo-files
  • /proc/cpuinfo: partially implemented

    • processor, vendor_id, cpu family, model, model name, stepping, physical id, core id, cpu cores, bogomips, siblings

    • flags: all known CPU flags

  • /proc/meminfo: partially implemented

    • MemTotal, MemFree, MemAvailable, Committed_AS, VmallocTotal

    • ☒ rest fields: always zero

  • /proc/stat: dummy

    • cpu line: all fields are zeros

    • cpuX lines: all fields are zeros

    • ctxt line: always zero

    • btime line: always zero

    • processes line: always one

    • procs_running line: always one

    • procs_blocked line: always zero

    • intr line

    • softirq line

  • /sys/devices/system/cpu/: only most important files implemented

    • /sys/devices/system/cpu/cpu[x]/

      • /sys/devices/system/cpu/cpu[x]/cache/index[x]/

        • /sys/devices/system/cpu/cpu[x]/cache/index[x]/coherency_line_size

        • /sys/devices/system/cpu/cpu[x]/cache/index[x]/level

        • /sys/devices/system/cpu/cpu[x]/cache/index[x]/number_of_sets

        • /sys/devices/system/cpu/cpu[x]/cache/index[x]/physical_line_partition

        • /sys/devices/system/cpu/cpu[x]/cache/index[x]/shared_cpu_map

        • /sys/devices/system/cpu/cpu[x]/cache/index[x]/size

        • /sys/devices/system/cpu/cpu[x]/cache/index[x]/type

      • /sys/devices/system/cpu/cpu[x]/online

      • /sys/devices/system/cpu/cpu[x]/topology/

        • /sys/devices/system/cpu/cpu[x]/topology/core_id

        • /sys/devices/system/cpu/cpu[x]/topology/core_siblings

        • /sys/devices/system/cpu/cpu[x]/topology/physical_package_id

        • /sys/devices/system/cpu/cpu[x]/topology/thread_siblings

    • /sys/devices/system/cpu/kernel_max

    • /sys/devices/system/cpu/offline

    • /sys/devices/system/cpu/online

    • /sys/devices/system/cpu/possible

    • /sys/devices/system/cpu/present

  • /sys/devices/system/node/: only most important files implemented

    • /sys/devices/system/node/node[x]/

      • /sys/devices/system/node/node[x]/cpumap

      • /sys/devices/system/node/node[x]/distance

      • /sys/devices/system/node/node[x]/hugepages/

        • /sys/devices/system/node/node[x]/hugepages/hugepages-[y]/nr_hugepages: always zero

      • /sys/devices/system/node/node[x]/meminfo: partially implemented

        • MemTotal, MemFree, MemUsed

        • ☒ rest fields: always zero


Misc

Gramine implements vDSO, with four functions: __vdso_clock_gettime(), __vdso_gettimeofday(), __vdso_time(), __vdso_getcpu(). These functions invoke the corresponding system calls, see the “Date and time” section and the “Scheduling” section.

Gramine implements operations on file descriptors (FDs):

  • duplicating FDs via dup(), dup2(), dup3(), fcntl(F_DUPFD), fcntl(F_DUPFD_CLOEXEC),

  • getting/setting FD flags via fcntl(F_GETFD) and fcntl(F_SETFD); the only flag is FD_CLOEXEC.

Gramine implements several arch-specific (x86-64) operations:

  • getting/setting the FS segment register via arch_prctl(ARCH_GET_FS) and arch_prctl(ARCH_SET_FS),

  • getting/setting the Intel AMX feature via arch_prctl(ARCH_GET_XCOMP_SUPP), arch_prctl(ARCH_GET_XCOMP_PERM) and arch_prctl(ARCH_REQ_XCOMP_PERM).

Gramine implements minimal session management via setsid() and getsid(). It is possible to make the calling process the leader of the new session, which is enough for many workloads (e.g. JVM). However, there are serious limitations:

  • in getsid(), it’s not possible to get session id of other processes (only of this process),

  • it’s impossible to send signals to a process group,

  • daemonization is still broken: the orphaned child is not adopted by init, because there is no init process in Gramine.

Gramine implements the /dev/null and /dev/zero pseudo-files.

Related system calls
  • gettimeofday(): implemented in vDSO

  • clock_gettime(): implemented in vDSO

  • time(): implemented in vDSO

  • getcpu(): implemented in vDSO

  • dup()

  • dup2()

  • dup3()

  • fcntl()

    • F_DUPFD

    • F_DUPFD_CLOEXEC

    • F_GETFD

    • F_SETFD

  • arch_prctl()

    • ARCH_GET_XCOMP_SUPP

    • ARCH_GET_XCOMP_PERM

    • ARCH_REQ_XCOMP_PERM

  • setsid()

  • getsid()

Related pseudo-files
  • /dev/

    • /dev/null

    • /dev/zero


Advanced/infeasible, unimplemented features

Gramine does not implement the following classes of features. This is by design, to keep the codebase of Gramine minimal.

  • Berkeley Packet Filters (BPF) and eBPF: bpf()

  • Capabilities: capget(), capset()

  • Execution control and debugging: ptrace(), syslog(), perf_event_open(), acct()

  • In-kernel key management (keyrings): add_key(), request_key(), keyctl()

  • Kernel modules: create_module(), init_module(), finit_module(), delete_module(), query_module(), get_kernel_syms()

  • Memory Protection Keys: pkey_alloc(), pkey_mprotect(), pkey_free()

  • Namespaces: setns(), unshare()

  • Paging and swapping: swapon(), swapoff(), readahead()

  • Process execution domain: personality()

  • Secure Computing (seccomp) state: seccomp()

  • Zero-copy transfer of data: splice(), tee(), vmsplice(), copy_file_range()

  • Transfer of data between processes: process_vm_readv(), process_vm_writev()

  • Filesystem configuration context: fsopen(), fsconfig(), fspick(), fsmount()

  • Landlock: landlock_create_ruleset(), landlock_add_rule(), landlock_restrict_self()

  • Misc: vhangup(), modify_ldt(), kexec_load(), kexec_file_load(), reboot(), iopl(), ioperm(), uselib(), _sysctl(), quotactl(), quotactl_fd(), nfsservctl(), getpmsg(), putpmsg(), afs_syscall(), tuxcall(), security(), lookup_dcookie(), restart_syscall(), vserver(), io_pgetevents(), rseq(), open_tree()

Related system calls
  • _sysctl()

  • acct()

  • add_key()

  • afs_syscall()

  • bpf()

  • capget()

  • capset()

  • close_range()

  • copy_file_range()

  • create_module()

  • delete_module()

  • finit_module()

  • fsconfig()

  • fsmount()

  • fsopen()

  • fspick()

  • get_kernel_syms()

  • getpmsg()

  • init_module()

  • io_pgetevents()

  • ioperm()

  • iopl()

  • kexec_file_load()

  • kexec_load()

  • keyctl()

  • landlock_add_rule()

  • landlock_create_ruleset()

  • landlock_restrict_self()

  • lookup_dcookie()

  • modify_ldt()

  • nfsservctl()

  • nfsservctl()

  • open_tree()

  • perf_event_open()

  • personality()

  • pkey_alloc()

  • pkey_free()

  • pkey_mprotect()

  • process_vm_readv()

  • process_vm_writev()

  • ptrace()

  • putpmsg()

  • query_module()

  • quotactl()

  • quotactl_fd()

  • readahead()

  • reboot()

  • request_key()

  • restart_syscall()

  • rseq()

  • seccomp()

  • security()

  • setns()

  • splice()

  • swapoff()

  • swapon()

  • syslog()

  • tee()

  • tuxcall()

  • unshare()

  • uselib()

  • vhangup()

  • vmsplice()

  • vserver()


Gramine-specific features

Attestation

Gramine exposes low-level abstractions of attestation report and attestation quote objects (SGX Report and SGX Quote accordingly, in case of SGX backend) through the /dev/attestation/ pseudo-filesystem. Manipulating with the /dev/attestation/ pseudo-files allows to program local attestation and remote attestation flows. Additionally, the /dev/attestation/keys/ pseudo-dir exposes pseudo-files to set encryption keys (in particular, for encrypted files).

For detailed information, refer to the “Attestation and Secret Provisioning” documentation of Gramine.

Related pseudo-files
  • /dev/attestation/

    • /dev/attestation/attestation_type

    • /dev/attestation/user_report_data

    • /dev/attestation/target_info

    • /dev/attestation/my_target_info

    • /dev/attestation/report

    • /dev/attestation/quote

    • /dev/attestation/keys

      • /dev/attestation/keys/<key_name>

      • /dev/attestation/keys/_sgx_mrenclave (only for SGX)

      • /dev/attestation/keys/_sgx_mrsigner (only for SGX)


Notes on System V ABI

⚠ Below description assumes x86-64 architecture.

Gramine implements the system-call entry point (analogous to the SYSCALL x86 instruction ABI). Instead of performing a context switch from userland (ring-3) to kernelspace (ring-0), Gramine relies on the system call being routed directly into Gramine process. There are two paths how the application’s system call requests end up in Gramine emulation:

  1. Fast path, through patched C standard library (e.g. Glibc or musl): Gramine ships patched Glibc and musl where raw SYSCALL instructions are replaced with function calls into Gramine’s syscall entry point.

  2. Slow path, through an exception-handling mechanism:

    • In case of Linux backend, Gramine sets up a seccomp policy that redirects all syscall requests from the Linux kernel back into the Gramine process.

    • In case of SGX backend, Intel SGX hardware itself forbids the SYSCALL instruction and instead generates a #UD (illegal instruction) exception, which is delivered into the Gramine process.

The fast path is recommended for all applications. However, some applications bypass Glibc/musl and issue raw SYSCALL instructions (e.g., Golang statically compiled binaries); in this case the slow path is activated.

Gramine’s syscall entry point implementation first saves the CPU context of the current application thread on the internal stack, then calls the syscall-emulation function, which, upon returning, calls context restoring function, which passes control back to the application thread. The context consists of GPRs, FP control word (fpcw) and the SSE/AVX/… control word (mxcsr).

Note that Gramine may clobber all FP/SSE/AVX/… (extended) state except the control words. We rely on the fact that applications do not assume that this extended state is preserved across system calls. Indeed, the extended state (bar control words) is explicitly described as not preserved by the System V ABI, and we assume that no sane application issues syscalls in a non-System-V compliant manner. See System V ABI docs, “Register Usage” for more information.

Gramine supports Linux x86-64 signal frames.

Notes on application loading

Gramine can execute only ELF binaries (executables and libraries) and executable scripts. Other formats are not supported.