OSH Bug: Reducing dtc package test failure to a printf '%b' bug

The Oils for Unix project has a shell, OSH, that aims to be broadly compatible with bash and POSIX sh shells. We want to use OSH to build and test every package in the core Alpine distribution. One package, dtc, failed it's test suite while running under OSH. (dtc is a tool to manage device trees.)

In this article, I outline how I reduced the entire dtc failure to a single line. I also highlight the debugging strategies which I found to be useful.

Reducing the Error

The very first step is to narrow down all errors into a minimally reproducible test from which a fix can be developed.

For APK package failures, I prefer to start with an "unwrapping the onion" method. This method works by first reproducing the error as it was encountered, and then slowly removing layers of abstraction and focusing on sections which exhibit the error.

More concretely, this started by running the full abuild process using regtest/aports, our Alpine Linux with OSH compatibility tester. The setup process is documented in the oils repo. Once setup, a full build and test suite run for dtc can be kicked off with:

regtest/aports-run.sh build-packages dtc osh-as-sh

This produces a lot of logs, but the key component is the failure:

fdtget-runtest.sh "\012\013\014\015\336\352\255\276\357" -tr label01.dts.fdtget.test.dtb /randomnode blob:	FAIL Results differ from expected

... (many lines later)

********** TEST SUMMARY
*     Total testcases:	2131
*                PASS:	2130
*                FAIL:	1
*   Bad configuration:	0
* Strange test result:	0
**********

I've never contributed to dtc before, nor have I used the tool. Instead of jumping directly to the issue, let's slowly dissect it.

abuild test

Now to remove a layer from the onion. The regtest/aports-run.sh build-packages task works by:

  1. Enter into an alpine chroot (with osh as sh) as user udu
  2. Run abuild -fr on the dtc package

For abuild: -r means run the build and tests on an APKFILE and -f means to ignore any caching. -fr is the same as -f -r.

We can do that ourselves too!

$ _chroot/aports-build/enter-chroot -u udu  # Enter chroot
(chroot) ~ $ cd aports/main/dtc  # Go to the dtc package

Now we are in a clean alpine linux environment ready to build the dtc package. We could reproduce what the aports script did and rerun abuild -fr. But, now that we are at a lower level, we can start to narrow in on the bug. The issue seems to occur in the test suite, so let's tell abuild to only run the test suite (using the cached version of dtc built by aports.) This can be done with abuild check.

(chroot) aports/main/dtc $ abuild check

... (blah blah blah)

fdtget-runtest.sh "\012\013\014\015\336\352\255\276\357" -tr label01.dts.fdtget.test.dtb /randomnode blob:	FAIL Results differ from expected

Perfect!

Goodbye abuild

abuild doesn't magically know every test suite and build command. It relies on a APKBUILD file. This defines various properties about the package, but also the process to build the package. Notably, these are just shell files! You can source APKBUILD to parse it. Properties will be stored as environment variables and actions can be invoked using the defined functions prepare, build, etc. (Although, aports injects some preamble like default_prepare which convolutes the process a bit.)

In our case, we only care about the check action which is where the dtc process fails. For dtc, the check action asks make to run the test suite.

check() {
        make check
}

But where's the Makefile? We'll need to forgo abuild now and build the package from scratch.

Just before that, we need to download, extract and patch the sources. This will be our last invocation of abuild. abuild fetch verify unpack prepare deps will do the following in order:

(chroot) aports/main/dtc $ abuild fetch verify unpack deps
... lots of logs
~/aports/main/dtc $ # The dtc source is now in src/dtc-1.7.2

This results in a faithful representation of the build directory that abuild used. Entering it, we can try our make check:

(chroot) aports/main/dtc $ cd src/dtc-1.7.2  # Go to the dtc source
(chroot) aports/main/dtc/src/dtc-1.7.2 $ make check

... lots of make output

fdtget-runtest.sh "\012\013\014\015\336\352\255\276\357" -tr label01.dts.fdtget.test.dtb /randomnode blob:      FAIL Results differ from expected

Tada! We still see the error, thus we've properly dropped down a level in abstraction.

Making it without make

Continuing forward requires us to dissect a novel (to us) codebase. The easiest way to do this is to guess, validate and test. If we really mess up, we can either abuild clean and restart the builddir. Or, in the worst case, reset the chroot.

It's also useful to know a few tools for finding and searching files in a deep file tree:

The simplest approach is to take the output from the tests as a command:

(chroot) aports/main/dtc/src/dtc-1.7.2 $ ./fdtget-runtest.sh "\012\013\014\015\336\352\255\276\357" -tr label01.dts.fdtget.test.dtb /randomnode blob
sh: fdtget-runtest.sh: not found

Unfortunately, it is not that easy, but that's where we can use find. To search for a file named 'fdtget-runtest.sh' starting from the current directory, we can use find as so:

(chroot) aports/main/dtc/src/dtc-1.7.2 $ find . -name fdtget-runtest.sh
./tests/fdtget-runtest.sh

The file we need is in tests/fdtget-runtest.sh!

If you are unfamiliar with find, here's a breakdown of the command. It's a good template to start:

find .  -name fdtget-runtest.sh
     ^  ^
     |  \- This is a "filter" which filters by name
     |     In this case, a match must have the name "fdtget-runtest.sh"
     |
     \- Pathname: search the current directory

Find is a mini query-language. Translating loosely to JS, this query is like:

lsRecursive(".")  // Return a list of all files and dirs in current directory
  .filter((item) => item.name === "fdtget-runtest.sh")

To confirm that this is actually the failing test command, we can try to run the script. (I know that I need to run this in the tests/ directory. You may have to guess and check, that's what I had to do!)

(chroot) aports/main/dtc/src/dtc-1.7.2  $ cd tests
(chroot) aports/main/dtc/src/dtc-1.7.2/tests $ ./fdtget-runtest.sh "\012\013\014\015\336\352\255\276\357" -tr label01.dts.fdtget.test.dtb /randomnode blob
Error loading shared library libfdt.so.1: No such file or directory (needed by ../fdtget)
Error relocating ../fdtget: fdt_strerror: symbol not found
Error relocating ../fdtget: fdt_next_property_offset: symbol not found
Error relocating ../fdtget: fdt_get_name: symbol not found
Error relocating ../fdtget: fdt_next_tag: symbol not found
Error relocating ../fdtget: fdt_getprop_by_offset: symbol not found
Error relocating ../fdtget: fdt_getprop: symbol not found
Error relocating ../fdtget: fdt_path_offset: symbol not found
Error relocating ../fdtget: fdt_first_property_offset: symbol not found
FAIL Returned error code 127

Another error! The key part here is "Error loading shared library libfdt.so.1". This means that the dynamic linker failed to find the "libfdt.so.1" file, but needed this file because it had the machine code for functions like fdt_strerror.

The dynamic linker is a tool to load code in "dynamic libraries" when you run the program. Without dynamic libraries, you have to paste in the code from your dependencies directly into your executable, "static linking". One downside of dynamic libraries is that you need to ensure that your executable files ship with the shared library files and those shared library files can be discovered by the dynamic linker when your program is run. Without those, then you don't have the code in your dependencies and the dynamic linker has to stop your program.

The dynamic linker usually only checks a few known directories for a required dynamic library, and then gives up. We can help it out by adding more directories into it's search path. (This is very similar to the PATH magic variable.) In fact, it's basically identical! We can set LD_LIBRARY_PATH with additional search directories.

First, we need to find the dynamic libraries ourselves (on Linux, these are .so files):

(chroot) aports/main/dtc/src/dtc-1.7.2/tests $ find .. -name libfdt.so.1
../libfdt/libfdt.so.1

(We use .. here because I wanted to search the entire project without having to cd ..).

Then we can set the LD_LIBRARY_PATH to include the directory containing the .so:

(chroot) aports/main/dtc/src/dtc-1.7.2/tests $ LD_LIBRARY_PATH=../libfdt ./fdtget-runtest.sh "\012\013\014\015\336\352\255\276\357" -tr label01.dts.fdtget.test.dtb /randomnode blob




EXPECTED :-:



\336\352\255\276\357
FAIL Results differ from expected

Perfect! Down another layer of abstractions.

Reading other people's shell scripts

We can still follow the "unwrapping the onion" method, so the next step is to drop down another level. Here we are at the stage that we must read someone else's shell scripts, which (often) gets hairy.

Around this point, you will start to notice that the first attempt at narrowing on the broken component won't reproduce the error. A good way to reduce is to employ hypothesis testing. The process is:

  1. Read the script, then
  2. Guess where the error is, and
  3. Test your guess

To start, what does fdtget-runtest.sh do?

(chroot) aports/main/dtc/src/dtc-1.7.2/tests $ cat fdtget-runtest.sh
#! /bin/sh

SRCDIR=`dirname "$0"`
. "$SRCDIR/testutils.sh"

LOG=tmp.log.$$
EXPECT=tmp.expect.$$
rm -f $LOG $EXPECT
trap "rm -f $LOG $EXPECT" 0

expect="$1"
printf '%b\n' "$expect" > $EXPECT
shift

verbose_run_log_check "$LOG" $VALGRIND $DTGET "$@"

if cmp $EXPECT $LOG>/dev/null; then
    PASS
else
    if [ -z "$QUIET_TEST" ]; then
        echo "EXPECTED :-:"
        cat $EXPECT
    fi
    FAIL "Results differ from expected"
fi

Attack #1: verbose_run_log_check

verbose_run_log_check seemed like the most likely point where we do anything actually interesting, like run the program. I thought that it must be failing to run the fdtget tool correctly. verbose_run_log_check isn't a program, but a shell function which must be sourced via . "$SRCDIR/testutils.sh.

(chroot) aports/main/dtc/src/dtc-1.7.2/tests $ grep -A15 verbose_run_log_check testutils.sh
verbose_run_log_check () {
    verbose_run_log "$@"
    ret="$?"
    FAIL_IF_SIGNAL $ret
    if [ $ret != 0 ]; then
        FAIL "Returned error code $ret"
    fi
}

(chroot) aports/main/dtc/src/dtc-1.7.2/tests $ # verbose_run_log_check in turn calls verbose_run_log, which is also in testutils.sh
(chroot) aports/main/dtc/src/dtc-1.7.2/tests $ grep -A15 verbose_run_log testutils.sh
verbose_run_log () {
    LOG="$1"
    shift
    "$@" > "$LOG" 2>&1
    ret=$?
    if [ -z "$QUIET_TEST" ]; then
        cat "$LOG" >&2
    fi
    return $ret
}

Trace this if you'd like, but what I gathered is that we are running:

$ valgrind "$DTGET" -tr label01.dts.fdtget.test.dtb /randomnode blob

The $DTGET var is also defined in testutils.sh.

(chroot) aports/main/dtc/src/dtc-1.7.2/tests $ grep DTGET testutils.sh
DTGET=${TEST_BINDIR}/fdtget

I don't care about running fdtget under valgrind because I don't think any of the errors are actually in the fdtget binary itself. I started by running fdtget directly.

$ LD_LIBRARY_PATH=../libfdt ../fdtget -tr label01.dts.fdtget.test.dtb /randomnode blob

This outputs some bytes which is supposed to equal this string: "\012\013\014\015\336\352\255\276\357"

But no errors occur and the output looks kind of regular. When I deal with bytes, I usually pipe into xxd which has some pretty output. (This comes packaged with vim on most systems.)

(chroot) aports/main/dtc/src/dtc-1.7.2/tests $ LD_LIBRARY_PATH=../libfdt ../fdtget -tr label01.dts.fdtget.test.dtb /randomnode blob | xxd
00000000: 0a0b 0c0d deea adbe ef0a                 ..........

I also added an echo statement to the fdtget-runtest.sh and verbose_run_log_check seemed to be capturing this output correctly. This guess was a dud.

Attempt #2: What's up with the comparison?

I decided to add some more echo statements in the shell script to investigate the exact contents of the expected and actual output strings. I want to verify that the issue isn't in the logic comparing the expected and actual outputs.

(chroot) aports/main/dtc/src/dtc-1.7.2/tests $ cat fdtget-runtest.sh

# ...
# $LOG and $EXPECT are file paths

expect="$1"
printf '%b\n' "$expect" > $EXPECT
shift

# What bytes are we expecting?
echo "Expecting"
xxd "$EXPECT"

verbose_run_log_check "$LOG" $VALGRIND $DTGET "$@"

# What bytes did the program output?
echo "Got"
xxd "$LOG"

# ...

Which, when run:

(chroot) aports/main/dtc/src/dtc-1.7.2/tests $ LD_LIBRARY_PATH=../libfdt ./fdtget-runtest.sh "\012\013\014\015\336\352\255\276\357" -tr label01.dts.fdtget.test.dtb /randomnode blob
Expecting
00000000: 0a0b 0c0d 5c33 3336 5c33 3532 5c32 3535  ....\336\352\255
00000010: 5c32 3736 5c33 3537 0a                   \276\357.




Got
00000000: 0a0b 0c0d deea adbe ef0a                 ..........
EXPECTED :-:



\336\352\255\276\357
FAIL Results differ from expected

Clearly the bytes

00000000: 0a0b 0c0d 5c33 3336 5c33 3532 5c32 3535  ....\336\352\255
00000010: 5c32 3736 5c33 3537 0a                   \276\357.

are unlike

00000000: 0a0b 0c0d deea adbe ef0a                 ..........

The comparison is working correctly. Stumped (and actually oblivious to the bug, in retrospect) I switched the chroot to use busybox ash (the default and known good shell in Alpine). Then, I reran the edited script on the test case:

(chroot) aports/main/dtc/src/dtc-1.7.2/tests $ LD_LIBRARY_PATH=../libfdt ./fdtget-runtest.sh "\012\013\014\015\336\352\255\276\357" -tr label01.dts.fdtget.test.dtb /randomnode blob
Expecting
00000000: 0a0b 0c0d deea adbe ef0a                 ..........
Got
00000000: 0a0b 0c0d deea adbe ef0a                 ..........

Immediately the issue became obvious, the expected string was interpreted incorrectly! Thus, comparison was working correctly, but generating the expected string was the root cause of the issue.

Attempt #3: "$EXPECT" is the issue

The test script is failing to convert "\012\013\014\015\336\352\255\276\357" into 0a0b 0c0d deea adbe ef0a. The particular section is here:

expect="$1"
printf '%b\n' "$expect" > $EXPECT
shift

Sure enough, OSH and bash are different:

osh$ printf '%b\n' "\012\013\014\015\336\352\255\276\357" | xxd
00000000: 0a0b 0c0d 5c33 3336 5c33 3532 5c32 3535  ....\336\352\255
00000010: 5c32 3736 5c33 3537 0a                   \276\357.

bash$ printf '%b\n' "\012\013\014\015\336\352\255\276\357" | xxd
00000000: 0a0b 0c0d deea adbe ef0a                 ..........

The Bug and the Fix

This is now a reproduction small enough that we can actually fix the issue. I submitted #2401 with the divergence from bash. The fix involved adding support for 3-digit octal escapes ('\123') in printf '%b'. Previously we only accepted 3 or 4-digit octal escapes where the first digit is a zero ('\012' or \0123).

More concisely, the issue could be reproduced with:

osh$ printf '%b\n' '\123'
\123
bash$ printf '%b\n' '\123'  # 123 in octal is 'S' in ASCII
S

Finally, a fix was implemented in this commit.

To review, in this article I outlined the process of reproducing and reducing the dtc package failure bug. The steps we took were:

  1. Reproduced it locally with abuild via regtest/aports
  2. Ran the test suite without abuild by calling make directly
  3. Located and ran the exact failing test script
  4. Added debug statements to the test script to sniff out the error
  5. Produced a single-line reproduction of the divergence between bash and OSH which caused this error