OSH Bug: Reducing dtc
package test failure to a printf '%b' bug
The Oils for Unix project has a shell, OSH, that aims to be
broadly compatible with bash and POSIX sh shells. We want to use OSH to build
and test every package in the core Alpine
distribution. One package,
dtc
, failed it's
test suite while running under OSH. (dtc
is a tool to manage device trees.)
In this article, I outline how I reduced the entire dtc
failure to a single
line. I also highlight the debugging strategies which I found to be useful.
Reducing the Error
The very first step is to narrow down all errors into a minimally reproducible test from which a fix can be developed.
For APK package failures, I prefer to start with an "unwrapping the onion" method. This method works by first reproducing the error as it was encountered, and then slowly removing layers of abstraction and focusing on sections which exhibit the error.
More concretely, this started by running the full abuild
process using
regtest/aports
, our Alpine Linux with OSH compatibility tester. The setup
process is
documented in the oils repo.
Once setup, a full build and test suite run for dtc
can be kicked off with:
regtest/aports-run.sh build-packages dtc osh-as-sh
This produces a lot of logs, but the key component is the failure:
fdtget-runtest.sh "\012\013\014\015\336\352\255\276\357" -tr label01.dts.fdtget.test.dtb /randomnode blob: FAIL Results differ from expected
... (many lines later)
********** TEST SUMMARY
* Total testcases: 2131
* PASS: 2130
* FAIL: 1
* Bad configuration: 0
* Strange test result: 0
**********
I've never contributed to dtc
before, nor have I used the tool. Instead of
jumping directly to the issue, let's slowly dissect it.
abuild test
Now to remove a layer from the onion. The regtest/aports-run.sh build-packages
task works by:
- Enter into an alpine chroot (with osh as sh) as user
udu
- Run
abuild -fr
on thedtc
package
For
abuild
:-r
means run the build and tests on anAPKFILE
and-f
means to ignore any caching.-fr
is the same as-f -r
.
We can do that ourselves too!
$ _chroot/aports-build/enter-chroot -u udu # Enter chroot
(chroot) ~ $ cd aports/main/dtc # Go to the dtc package
Now we are in a clean alpine linux environment ready to build the dtc
package. We could reproduce what the aports
script did and rerun abuild -fr
. But, now that we are at a lower level, we can start to narrow in on the
bug. The issue seems to occur in the test suite, so let's tell abuild
to only
run the test suite (using the cached version of dtc
built by aports
.) This
can be done with abuild check
.
(chroot) aports/main/dtc $ abuild check
... (blah blah blah)
fdtget-runtest.sh "\012\013\014\015\336\352\255\276\357" -tr label01.dts.fdtget.test.dtb /randomnode blob: FAIL Results differ from expected
Perfect!
Goodbye abuild
abuild
doesn't magically know every test suite and build command. It relies
on a APKBUILD
file. This defines various properties about the package, but
also the process to build the package. Notably, these are just shell files! You
can source APKBUILD
to parse it. Properties will be stored as environment
variables and actions can be invoked using the defined functions prepare
,
build
, etc. (Although, aports
injects some preamble like default_prepare
which convolutes the process a bit.)
In our case, we only care about the check
action which is where the dtc
process fails. For dtc
, the check action asks make
to run the test suite.
check() {
make check
}
But where's the Makefile? We'll need to forgo abuild
now and build the
package from scratch.
Just before that, we need to download, extract and patch the sources. This will
be our last invocation of abuild
. abuild fetch verify unpack prepare deps
will do the following in order:
fetch
Get the latestdtc
tarball, save it tosrc/
verify
Validate checksumsunpack
Extract the tarball (still undersrc/
)prepare
Apply all patchesdeps
Install all package dependencies
(chroot) aports/main/dtc $ abuild fetch verify unpack deps
... lots of logs
~/aports/main/dtc $ # The dtc source is now in src/dtc-1.7.2
This results in a faithful representation of the build directory that abuild
used. Entering it, we can try our make check
:
(chroot) aports/main/dtc $ cd src/dtc-1.7.2 # Go to the dtc source
(chroot) aports/main/dtc/src/dtc-1.7.2 $ make check
... lots of make output
fdtget-runtest.sh "\012\013\014\015\336\352\255\276\357" -tr label01.dts.fdtget.test.dtb /randomnode blob: FAIL Results differ from expected
Tada! We still see the error, thus we've properly dropped down a level in abstraction.
Making it without make
Continuing forward requires us to dissect a novel (to us) codebase. The easiest
way to do this is to guess, validate and test. If we really mess up, we can
either abuild clean
and restart the builddir
. Or, in the worst case, reset
the chroot.
It's also useful to know a few tools for finding and searching files in a deep file tree:
find
to search for files, good if you want to know where a script isgrep
to search within files, good to know from whence a string in your logs is from
The simplest approach is to take the output from the tests as a command:
(chroot) aports/main/dtc/src/dtc-1.7.2 $ ./fdtget-runtest.sh "\012\013\014\015\336\352\255\276\357" -tr label01.dts.fdtget.test.dtb /randomnode blob
sh: fdtget-runtest.sh: not found
Unfortunately, it is not that easy, but that's where we can use find
. To
search for a file named 'fdtget-runtest.sh'
starting from the current
directory, we can use find
as so:
(chroot) aports/main/dtc/src/dtc-1.7.2 $ find . -name fdtget-runtest.sh
./tests/fdtget-runtest.sh
The file we need is in tests/fdtget-runtest.sh
!
If you are unfamiliar with
find
, here's a breakdown of the command. It's a good template to start:find . -name fdtget-runtest.sh ^ ^ | \- This is a "filter" which filters by name | In this case, a match must have the name "fdtget-runtest.sh" | \- Pathname: search the current directory
Find is a mini query-language. Translating loosely to JS, this query is like:
lsRecursive(".") // Return a list of all files and dirs in current directory .filter((item) => item.name === "fdtget-runtest.sh")
To confirm that this is actually the failing test command, we can try to run
the script. (I know that I need to run this in the tests/
directory. You may
have to guess and check, that's what I had to do!)
(chroot) aports/main/dtc/src/dtc-1.7.2 $ cd tests
(chroot) aports/main/dtc/src/dtc-1.7.2/tests $ ./fdtget-runtest.sh "\012\013\014\015\336\352\255\276\357" -tr label01.dts.fdtget.test.dtb /randomnode blob
Error loading shared library libfdt.so.1: No such file or directory (needed by ../fdtget)
Error relocating ../fdtget: fdt_strerror: symbol not found
Error relocating ../fdtget: fdt_next_property_offset: symbol not found
Error relocating ../fdtget: fdt_get_name: symbol not found
Error relocating ../fdtget: fdt_next_tag: symbol not found
Error relocating ../fdtget: fdt_getprop_by_offset: symbol not found
Error relocating ../fdtget: fdt_getprop: symbol not found
Error relocating ../fdtget: fdt_path_offset: symbol not found
Error relocating ../fdtget: fdt_first_property_offset: symbol not found
FAIL Returned error code 127
Another error! The key part here is "Error loading shared library libfdt.so.1".
This means that the dynamic linker failed to find the "libfdt.so.1" file, but
needed this file because it had the machine code for functions like
fdt_strerror
.
The dynamic linker is a tool to load code in "dynamic libraries" when you run the program. Without dynamic libraries, you have to paste in the code from your dependencies directly into your executable, "static linking". One downside of dynamic libraries is that you need to ensure that your executable files ship with the shared library files and those shared library files can be discovered by the dynamic linker when your program is run. Without those, then you don't have the code in your dependencies and the dynamic linker has to stop your program.
The dynamic linker usually only checks a few known directories for a required
dynamic library, and then gives up. We can help it out by adding more
directories into it's search path. (This is very similar to the PATH
magic
variable.) In fact, it's basically identical! We can set LD_LIBRARY_PATH
with
additional search directories.
First, we need to find the dynamic libraries ourselves (on Linux, these are
.so
files):
(chroot) aports/main/dtc/src/dtc-1.7.2/tests $ find .. -name libfdt.so.1
../libfdt/libfdt.so.1
(We use ..
here because I wanted to search the entire project without having
to cd ..
).
Then we can set the LD_LIBRARY_PATH
to include the directory containing the
.so
:
(chroot) aports/main/dtc/src/dtc-1.7.2/tests $ LD_LIBRARY_PATH=../libfdt ./fdtget-runtest.sh "\012\013\014\015\336\352\255\276\357" -tr label01.dts.fdtget.test.dtb /randomnode blob
EXPECTED :-:
\336\352\255\276\357
FAIL Results differ from expected
Perfect! Down another layer of abstractions.
Reading other people's shell scripts
We can still follow the "unwrapping the onion" method, so the next step is to drop down another level. Here we are at the stage that we must read someone else's shell scripts, which (often) gets hairy.
Around this point, you will start to notice that the first attempt at narrowing on the broken component won't reproduce the error. A good way to reduce is to employ hypothesis testing. The process is:
- Read the script, then
- Guess where the error is, and
- Test your guess
To start, what does fdtget-runtest.sh
do?
(chroot) aports/main/dtc/src/dtc-1.7.2/tests $ cat fdtget-runtest.sh
#! /bin/sh
SRCDIR=`dirname "$0"`
. "$SRCDIR/testutils.sh"
LOG=tmp.log.$$
EXPECT=tmp.expect.$$
rm -f $LOG $EXPECT
trap "rm -f $LOG $EXPECT" 0
expect="$1"
printf '%b\n' "$expect" > $EXPECT
shift
verbose_run_log_check "$LOG" $VALGRIND $DTGET "$@"
if cmp $EXPECT $LOG>/dev/null; then
PASS
else
if [ -z "$QUIET_TEST" ]; then
echo "EXPECTED :-:"
cat $EXPECT
fi
FAIL "Results differ from expected"
fi
Attack #1: verbose_run_log_check
verbose_run_log_check
seemed like the most likely point where we do anything
actually interesting, like run the program. I thought that it must be failing
to run the fdtget
tool correctly. verbose_run_log_check
isn't a program,
but a shell function which must be sourced via . "$SRCDIR/testutils.sh
.
(chroot) aports/main/dtc/src/dtc-1.7.2/tests $ grep -A15 verbose_run_log_check testutils.sh
verbose_run_log_check () {
verbose_run_log "$@"
ret="$?"
FAIL_IF_SIGNAL $ret
if [ $ret != 0 ]; then
FAIL "Returned error code $ret"
fi
}
(chroot) aports/main/dtc/src/dtc-1.7.2/tests $ # verbose_run_log_check in turn calls verbose_run_log, which is also in testutils.sh
(chroot) aports/main/dtc/src/dtc-1.7.2/tests $ grep -A15 verbose_run_log testutils.sh
verbose_run_log () {
LOG="$1"
shift
"$@" > "$LOG" 2>&1
ret=$?
if [ -z "$QUIET_TEST" ]; then
cat "$LOG" >&2
fi
return $ret
}
Trace this if you'd like, but what I gathered is that we are running:
$ valgrind "$DTGET" -tr label01.dts.fdtget.test.dtb /randomnode blob
The $DTGET
var is also defined in testutils.sh
.
(chroot) aports/main/dtc/src/dtc-1.7.2/tests $ grep DTGET testutils.sh
DTGET=${TEST_BINDIR}/fdtget
I don't care about running fdtget
under valgrind
because I don't think any of the errors are actually in the fdtget
binary
itself. I started by running fdtget
directly.
$ LD_LIBRARY_PATH=../libfdt ../fdtget -tr label01.dts.fdtget.test.dtb /randomnode blob
This outputs some bytes which is supposed to equal this string:
"\012\013\014\015\336\352\255\276\357"
But no errors occur and the output looks kind of regular. When I deal with
bytes, I usually pipe into xxd
which has some pretty output. (This comes
packaged with vim
on most systems.)
(chroot) aports/main/dtc/src/dtc-1.7.2/tests $ LD_LIBRARY_PATH=../libfdt ../fdtget -tr label01.dts.fdtget.test.dtb /randomnode blob | xxd
00000000: 0a0b 0c0d deea adbe ef0a ..........
I also added an echo
statement to the fdtget-runtest.sh
and
verbose_run_log_check
seemed to be capturing this output correctly. This
guess was a dud.
Attempt #2: What's up with the comparison?
I decided to add some more echo
statements in the shell script to investigate
the exact contents of the expected and actual output strings. I want to verify
that the issue isn't in the logic comparing the expected and actual outputs.
(chroot) aports/main/dtc/src/dtc-1.7.2/tests $ cat fdtget-runtest.sh
# ...
# $LOG and $EXPECT are file paths
expect="$1"
printf '%b\n' "$expect" > $EXPECT
shift
# What bytes are we expecting?
echo "Expecting"
xxd "$EXPECT"
verbose_run_log_check "$LOG" $VALGRIND $DTGET "$@"
# What bytes did the program output?
echo "Got"
xxd "$LOG"
# ...
Which, when run:
(chroot) aports/main/dtc/src/dtc-1.7.2/tests $ LD_LIBRARY_PATH=../libfdt ./fdtget-runtest.sh "\012\013\014\015\336\352\255\276\357" -tr label01.dts.fdtget.test.dtb /randomnode blob
Expecting
00000000: 0a0b 0c0d 5c33 3336 5c33 3532 5c32 3535 ....\336\352\255
00000010: 5c32 3736 5c33 3537 0a \276\357.
Got
00000000: 0a0b 0c0d deea adbe ef0a ..........
EXPECTED :-:
\336\352\255\276\357
FAIL Results differ from expected
Clearly the bytes
00000000: 0a0b 0c0d 5c33 3336 5c33 3532 5c32 3535 ....\336\352\255
00000010: 5c32 3736 5c33 3537 0a \276\357.
are unlike
00000000: 0a0b 0c0d deea adbe ef0a ..........
The comparison is working correctly. Stumped (and actually oblivious to the bug, in retrospect) I switched the chroot to use busybox ash (the default and known good shell in Alpine). Then, I reran the edited script on the test case:
(chroot) aports/main/dtc/src/dtc-1.7.2/tests $ LD_LIBRARY_PATH=../libfdt ./fdtget-runtest.sh "\012\013\014\015\336\352\255\276\357" -tr label01.dts.fdtget.test.dtb /randomnode blob
Expecting
00000000: 0a0b 0c0d deea adbe ef0a ..........
Got
00000000: 0a0b 0c0d deea adbe ef0a ..........
Immediately the issue became obvious, the expected string was interpreted incorrectly! Thus, comparison was working correctly, but generating the expected string was the root cause of the issue.
Attempt #3: "$EXPECT" is the issue
The test script is failing to convert "\012\013\014\015\336\352\255\276\357"
into 0a0b 0c0d deea adbe ef0a
. The particular section is here:
expect="$1"
printf '%b\n' "$expect" > $EXPECT
shift
Sure enough, OSH and bash are different:
osh$ printf '%b\n' "\012\013\014\015\336\352\255\276\357" | xxd
00000000: 0a0b 0c0d 5c33 3336 5c33 3532 5c32 3535 ....\336\352\255
00000010: 5c32 3736 5c33 3537 0a \276\357.
bash$ printf '%b\n' "\012\013\014\015\336\352\255\276\357" | xxd
00000000: 0a0b 0c0d deea adbe ef0a ..........
The Bug and the Fix
This is now a reproduction small enough that we can actually fix the issue. I
submitted #2401 with the
divergence from bash. The fix involved adding support for 3-digit octal escapes
('\123'
) in printf '%b'
. Previously we only accepted 3 or 4-digit octal escapes
where the first digit is a zero ('\012'
or \0123
).
More concisely, the issue could be reproduced with:
osh$ printf '%b\n' '\123'
\123
bash$ printf '%b\n' '\123' # 123 in octal is 'S' in ASCII
S
Finally, a fix was implemented in this commit.
To review, in this article I outlined the process of reproducing and reducing
the dtc
package failure bug. The steps we took were:
- Reproduced it locally with
abuild
viaregtest/aports
- Ran the test suite without
abuild
by callingmake
directly - Located and ran the exact failing test script
- Added debug statements to the test script to sniff out the error
- Produced a single-line reproduction of the divergence between bash and OSH which caused this error