Testing software at Foundries.io (part 2)

Photo of Milosz Wasilewski

Posted on Jul 20, 2022 by Milosz Wasilewski

3 min read

In the previous blog I described how Foundries.io approaches testing in regards to the "manufacturing scenario". In this blog I will be covering how to test the "rolling update scenario". This applies to the case of a long running device in the field with no direct way of accessing it.

The goal is to run the same set of tests in both the manufacturing and rolling update scenarios. This, however, is not possible; some tests corrupt either the whole filesystem, or at least one commit in the ostree (OTA rollback test). Cleaning up after such a test to bring back the “proper” device state is hard. Additionally, tests that break networking are excluded as they would have no means of reporting the results back. This leaves a limited set of tests to be executed.

When the device is “always online”, the timing of test execution becomes relevant. To solve this, we run the tests after a successful OTA update. This means testing the new software shortly after it starts running. The callback mechanism from aktualizr-lite is used to trigger the testing round.

As mentioned above, the plan is to trigger the same tests as in the manufacturing scenario. Using test-defintions to run tests in LAVA makes it possible. In test-definitions, there is a tool called test-runner.py that is able to run tests in a similar fashion to LAVA on a remote target. Importantly, it takes .yaml test-definition files to execute the tests, meaning that no changes to the tests or parameters from the LAVA test jobs described previously are required, but one more hurdle to jump remains. Starting test-runner.py from callback will not work. as this would require test-definitions to be built into the OS image. As there is no OE layer for test-definitons this would require quite a lot of effort.

Instead, tests are run from within the container. Cloning the test-definitions repository as a part of container build makes it easier to control the exact version of the tests. The fiotest container is used as basis executing the test-definitions. The last missing bit is triggering the proper test from within fiotest spec. This is done by using a custom shell script. The script collects data about the device including the current and previous targets, factory, etc. It then executes test-runner.py which runs the actual test. In the fiotest spec file, this looks like:

  sequence:
  - tests:
      - name: fs-resize
        command:
          - /run_test_runner.sh
          - automated/linux/fs-resize/fs-resize.yaml
      - name: aklite
        command:
          - /run_test_runner.sh
          - automated/linux/aklite/aklite.yaml

The sequence above directly corresponds to the one used in a LAVA job. This means that the same tests get executed in both manufacturing and rolling update scenarios.

Using both fiotest and test-runner.py makes two reporting options available, and we are using both of them. fiotest reports the results back to the FoundriesFactory backend. This method is limited as some contextual information and logs are missing. On the other hand, test-runner.py reports the results to qa-reports.foundries.io.

The final topic to discuss is the monitoring of long running services and OTA upgrades. This will be described in part 3.

Related posts