Let's Write A PATH Traverser

When you open up your terminal, you can type in commands like python3 or pip. Your shell uses the PATH variable to find a file named python3 which it can execute. But how?

This article is meant to be an interactive tutorial on how to write an important component of a shell: the PATH traverser.

You can read it in two ways:

  1. Glance over the code examples
  2. Follow along like a tutorial

Either is fine!

Getting Started

Skip to Part 1 if you don't want to follow along locally.

I have setup a small playground which you can run locally using podman or docker:

podman pull ghcr.io/possiblyashrub/path-traversal:latest
podman run --name path-traversal-tutorial -it ghcr.io/possiblyashrub/path-traversal:latest

# OR (if you have docker instead)

docker pull ghcr.io/possiblyashrub/path-traversal:latest
docker run --name path-traversal-tutorial -it ghcr.io/possiblyashrub/path-traversal:latest

You can find the source and CI pipeline which builds these images on this GiHub repository.

In this article, we will be writing the program in python. You are free to try to follow along in any language. C, Go, Rust or Zig will work too. If you use a different language, make sure that it has libc bindings. You will need to be able to make syscalls. We won't use any fancy language features, so you might be fine following along in python even if it's not your main language.

Part 1: What's PATH

We will start by parsing the path variable. If you've never seen your PATH before, now is a great time to learn. The PATH is a list of directories, separated by a colon ":", which instruct the path traversal algorithm where to search for a program and in what order.

For example, in the container if we run echo $PATH you should get something like: "/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin".

This corresponds to the list, ['/usr/local/sbin', '/usr/local/bin', '/usr/sbin', '/usr/bin', '/sbin', '/bin']. So when I run a command like whoami, we:

  1. Check /usr/local/sbin/whoami -> No such file, continue
  2. Check /usr/local/bin/whoami -> No such file, continue
  3. Check /usr/sbin/whoami -> No such file, continue
  4. Check /usr/bin/whoami -> File exists, execute it

Don't believe me? We can use a tool called strace to see all the system calls the shell makes:

osh-0.35# strace -e access osh -c 'whoami'
access("/etc/terminfo/x/xterm", R_OK)   = 0
access("/usr/local/sbin/whoami", X_OK)  = -1 ENOENT (No such file or directory)
access("/usr/local/bin/whoami", X_OK)   = -1 ENOENT (No such file or directory)
access("/usr/sbin/whoami", X_OK)        = -1 ENOENT (No such file or directory)
access("/usr/bin/whoami", X_OK)         = 0
root
+++ exited with 0 +++

strace -e access some command runs some command but logs all the access system calls made by some command. This is the syscall a shell uses to check if a file is executable.

Part 2: The PATH Traversal Algorithm

From the description in part 1, let's try to implement a PATH traversal algorithm.

At it's core, the algorithm has the following shape:

def lookup_executable(name: str, path: str) -> str | None:
    # PATH is a list of directories separated by colons ':'. For example "/usr/bin:/usr/local/bin"
    directories = path.split(":")

    # We check each directory in order from start to end
    for directory in directories:
        candidate = f"{directory}/{name}"
        if is_exe(candidate):
            return candidate

    return None

But you should notice that we have to call out to a magic is_exe function. How can we see if a file is executable? In UNIX, any file, from a shell script to a ELF-binary (Linux version of a .exe), could be executable.

If you were keen, in part 1 you saw that there is a syscall called access (often written access(2)) which can be used to check if a file exists AND is executable.

But, let's dig into the details of what access(2) does. We can check its "manpage" with man 2 access. Run this command locally. What should we set as the mode in our PATH traversal?

access(2)                     System Calls Manual                    access(2)

...

DESCRIPTION
       access() checks whether the calling process can access the file
       pathname.  If pathname is a symbolic link, it is dereferenced.

The "(2)" in access(2) refers to the fact that access is in "section 2" and thus is a system call.

The "(2)" is confusing, and I often forget it. You can try to run man access instead of man 2 access and it should work. But beware that dropping the 2 send you to a man page in a section you did not intend. For example, on my desktop PC man stat takes me to the stat entry in section 1 with details on the stat command-line tool, not the stat syscall. For that I'd need to run man 2 stat.

Let's edit our path traversal algorithm to use access(2). First, we need to import os to get access to os.access(), the python binding to access(2).

# traverse.py

import os

You can call os.access as follows:

# Can the current user read and execute the file at $path?
os.access(path, os.X_OK)

The second argument to access(2) is mode. In our case, we want to pass os.X_OK because we only need to check that we are able to execute the file at the candidate path. You could also check for read permissions with os.R_OK. However, you don't actually need to have permission to read a file in order to execute it. (Try it!)

Now, we can implement the lookup_executable routine we sketched above, but we will replace is_exe with os.access(candidate, os.X_OK):

# traverse.py

def lookup_executable(name: str, path: str) -> str | None:
    directories = path.split(":")

    for directory in directories:
        candidate = f"{directory}/{name}"
        if os.access(candidate, os.X_OK):
            return candidate

    return None

If we run our tests...

osh-0.35# ./test.ysh 
Running tests

==== 1 path, 1 executable ====
$ python3 driver.py hello /test
/test/hello
PASS

...

==== Directories aren't executable ====
...
FAIL: Directories aren't executable: Expected /test/path2/hello but got /test/path1/hello

============
Pass: 4
Fail: 1

Almost there! There is one more gotcha.

Part 3: Directories are Executable

UNIX filesystem APIs are tricky. A lot of this has to do with historical cruft. So here's a trivia question, what does access(2) return when called on a directory (not a file) with mode = os.X_OK?

It returns true! And here's the crazy part, many directories are executable!

Why? The executable bit on a file has been repurposed for the ability to read files inside a directory (that also have the read bit set). The read bit is instead used to mark if a user can list all the files in that directory. This means it's possible to have a directory where you can read the files in it, but you cannot see what files are present.

Why would this be helpful? Take the following example: on my University's undergrad lab machines, there was a file /home/cmput415/env.sh that setup some environment variables when you source-ed it. The folder /home/cmput415/ did not let students read (list its contents). Maybe it contained all the answers to our assignments? I do not and cannot know. But, that folder gave all students permissions which allowed us to access the /home/cmput415/env.sh file anyways.

But you can't actually execute directories! So, how can we tell access(2) to only return true if the file is a real executable? This is a trick question; we can't!

Instead, we need to make another syscall, stat. Then, we need to check the stat result to see if the path is a directory.

First, import stat, which contains a needed python helper.

# traverse.py

import stat

Then we need to add an additional check after os.access guarding against the case that the file is a directory. stat(2) (see man 2 stat) returns many properties. All we care about is if the directory is a path. For that, we use stat.S_ISREG(stat_result.st_mode) which is true if the file is a "regular file", ie. not a directory.

# traverse.py

        if os.access(candidate, os.X_OK):
            stat_result = os.stat(candidate)
            if stat.S_ISREG(stat_result.st_mode):
                return candidate

Now, when we run test:

osh-0.35# ./test.ysh 
Running tests

...

============
Pass: 5
Fail: 0

Whoo hoo! You've now written a core component of a shell :)

Part 4: You cannot simply stat

One of the drawbacks of our PATH traversal routine is that it requires 2 syscalls. Since a shell tends to make lots of syscalls, any chance to reduce the number of syscalls needed is an easy way to optimize. Our solution requiring 2 syscalls per PATH entry is not detrimental, but I want to argue why this is necessary.

stat returns the permission bits and the ids of the file's owning user and group. Combined with our knowledge of the current user's id and group ids we should be able to determine if the current user can execute the file. But there are some exceptions!

Here are 2 ways I know of to prevent execution even though the permission bits might imply otherwise:

stat(2) is not powerful enough to check all these cases, but access(2) is. Meanwhile, access(2) is not powerful enough to check if the given path is a directory. So we are at an impasse; both must be used.

Conclusion

In this article we:

You could continue down the rabbit hole in a few ways. First, try testing the path traversal algorithm using noexec filesystems or access control lists. Can you convince yourself why we need to stat(2) and access(2)?

Another direction is on the subject of performance. Shells like OSH and Bash tend to cache PATH traversal using a hash-table. You could try adding a cache to your PATH traversal algorithm.

Finally, if this type of work interests you. Try contributing to oils for unix! We are working on two new shells, OSH and YSH. OSH is the most bash compatible shell (other than bash itself) and YSH is an upgrade path from bash to a more predictable language. (Curious about the language? The test.ysh script in this tutorial was written in YSH!)