Container technologies have been playing a significant role in the software development world since their inception with
chroot. Thanks to Docker, containers have become the de facto standard for packaging, distribution, deployment, and running of cloud applications. You know they say "Build, Ship, and Run Any App, Anywhere". The question is, though, is this
Anywhere really anywhere? In particular, can we "Build, Ship and Run Any App, on Any IoT & Edge Devices" or at least Some Apps on Some IoT & Edge Devices?
Let's think about whether Containers can be applied to IoT & Edge software development too, whether it's feasible at all, and if yes, what benefits it brings. And last but not least, what challenges in adaptation of Containers to the Edge use-case there are.
Containers in the Cloud
Prior to answering these questions, it makes sense to step back a bit and summarize the benefits that Containers bring to Cloud application development and running. In general, Containerization is a means to improve App life-cycle management through:
- Consistency that allows developers to create consistent and recreatable App runtime environments that are isolated from each other and can include dependencies.
- Portability. A Container creates an executable software package abstracted away from a host system. Hence, it is not dependent upon or tied to the host system, making it portable and allowing it to run consistently and uniformly across any platform or cloud.
- Isolation. Isolating applications from a host system and from each other improves security.
- Granular resource management that enables optimization of host resource utilization.
In conjunction with container orchestration tools (e.g. Docker Swarm, vanilla k8s, Red Hat OpenShift, etc), it drastically improves an overall CI/CD process as well as monitoring, and increasing the resilience and scalability of applications. Containerization along with container orchestration serves as a foundation for Cloud Native App architecture and development, which brings software development, operations and maintenance to an absolutely different level. From a business perspective, it leads to:
- faster time to market;
- increasing of developer productivity;
- OpEx reduction/optimization.
It has its downsides and imposes a new set of challenges, but overall the benefits and improvements are indisputable. So, again, can we get out of Containerization the same benefits for IoT & Edge App development, operations and monitoring, so:
- embedded engineers will become happier;
- IoT & Edge companies will be more competitive and mature;
- users of IoT & Edge devices will become more satisfied?
Bumpy Road From the Cloud Towards the Edge
We, at Foundries.io, truly believe that the answer to the aforementioned question is Yes (we are a gang of crazy engineers). Of course, there are challenges and not everything in Container tools/engines exactly fits/is suitable for running them on IoT & Edge devices.
Although this blog series is dedicated to delivering and running containerized Apps on IoT & Edge devices, let us recap the development, packaging and distribution parts of the overall flow.
Our solution is based on the premise that an IoT & Edge containerized App includes one or more containers and its parts do not need to be deployed across multiple devices to perform its function(s). Therefore, we (praise to Andy) decided to combine usage of the Compose file format to define multi-containers applications and Docker Compose utility for running multi-container applications on a single device, as result, a notion of "Compose App" was born.
The bumpy road towards Compose Apps can be felt through reading of these two blogs: LMP Container Orchestration Roadmap and LMP Container Orchestration Roadmap - Episode 2. This tutorial is a good departure point for a journey across Compose App's nuts and bolts.
Compose App is a packaging and distribution unit that our CI/CD deals with. In a nutshell, the development & deployment flow of Compose App consists of the following steps:
Development of App's containers, their definition (Dockerfile) and App's compose specification (docker-compose.yml). This process is not always smooth, especially when it comes to development of embedded applications targeting different CPU architectures and/or hardware platforms. This blog can be helpful in this regard, it advises how to debug, troubleshoot and test your App on a target device. This is an example of a pitfall that one may run into while doing multi-arch container images building.
Adding Compose App to the git repository that is part of our overall offering.
The agent, running on devices, checks for the new Target, and:
- pulls content of the new Target's Compose App (container images, configs, etc),
- runs the Compose App by means of
docker-compose, the docker engine (dockerd), and containerd.
Let us come back to the topic of this blog series: Compose Apps delivering to devices in a secure and robust way. The security part is guaranteed by:
- Application of The Update Framework within the scope of our OTA service.
- The mTLS-based Authentication Schema for image downloading from Registry.
The delivery part can be performed by means of the docker engine (docker-compose --- [unix socket] ---> dockerd --- [pull/https] ---> Registry). Unfortunately, it turned out that this part of dockerd is not as robust as is required for IoT & Edge devices. In particular, a power cut during container images downloading and extracting may lead to such a state of the docker image store (
/var/lib/docker/overlay2) that subsequent image pulls fail at the layer extraction step. The peculiar details can be found in the description of the issue we posted to the moby project.
Applying Band-aids to Docker
We had no choice but to jump into the issue investigation and try to fix it. We spent many hours studying the dockerd source code and debugging it (and trying to remember why we decided to become a software engineer in the first place). As a result, we came to the conclusion that the process of image injection into the docker engine's store is not an atomic process by design. Also, changes to the store content are not synced to underlying storage, in particular. Thus, it can be brought to an inconsistent state if image layers extraction and injection into the store is abruptly interrupted (e.g. power cut).
The multi-hour debugging session helped to identify the most critical spots and to come up with the hotfix. Effectively, the patch just
fsync'd specific files to storage during image layer injection into the store. The fix is just a band aid and helps to decrease the issue occurrence significantly, however it does not fix it completely. Along with this hotfix, we developed another band-aid — a tool that detects and fixes inconsistencies in the docker engine store. The tool is included into the LmP platform and is started as a one-shot systemd service before the docker service. These two band-aids helped to decrease the issue occurrences enough to bring the overall solution to the commercial grade level.
There is No Perfect Software
In addition to the issue stated above, the docker image pull process is not quite optimized in terms of network and storage utilization. In particular, it makes use of temporal storage for storing of image layer archives being downloaded before they are extracted and injected into the docker engine store - hence, not optimal storage usage. If a pull process of an image is interrupted before all its layers are downloaded and injected, at the next try the overall pull process starts from the beginning and re-downloads all its layers again, regardless whether they were fetched at the first try or not — hence, not optimal network usage.
In addition to that, we have got to support the so-called "factory reset" feature, which implies removing all Apps' user and runtime data and restoring Apps to their initial state. Although docker provides some tools to remove runtime data, such as containers, volumes, and networking, none of these tools accurately clean the docker engine runtime stuff. The only reliable way is just to remove the overall docker data root directory, i.e.
/var/lib/docker by default. While it helps to clean docker data, in this case restoring Apps' images requires re-downloading all of them, which is not acceptable for resource constrained devices with poor network bandwidth. As an outcome of contemplation through these problems "Restorable" Compose Apps were born.
Restorable Compose Apps
The key idea behind "Restorable" Compose Apps is to bypass the docker engine's image pulling functionality and store images data out of the docker engine's data root (
/var/lib/docker). Therefore, the image downloading process can be atomic or more robust and free of the aforementioned issues, and Apps can be restored after a factory reset eliminating a need in Apps' images re-fetching.
In order to avoid a wheel reinvention and produce another container utility from scratch, we decided to utilize skopeo for performing these two operations: image pulling and injection into the docker store.
Skopeo's image downloading functionality is not vulnerable to power cuts. Also, it does not re-download all image layers from the beginning in case of interruption, the next download try starts from the layer download of which was interrupted. Also,
skopeo injects images into the docker store differently than the
docker pull path/API, specifically via the image load API. Although implementation of the image load API is not entirely atomic, it's fault tolerant and can recover from different inconsistencies in the docker store caused by an abrupt termination of an image loading process.
There is one drawback in
skopeo usage though. It requires additional storage for storing Apps' data (layers, manifests, configs, etc). However, if we analyze an overall process more deeply, it's not exactly a drawback in comparison to using the docker's pull functionality in the case of regular Compose Apps. First, keep in mind that Docker's pull process requires additional temporal storage for image layers during
docker pull anyway. Hence, the overall storage usage during Apps lifetime is the same in both cases. Secondly, the majority of the Apps' data are image layers that are stored as archives (
gzip format). This is not a big cost for making Apps management more robust and the ability to restore Apps at any time without a need in re-fetching them.
Nothing Can Stop Us: Next Steps for Improvement
In spite of the aforementioned enhancement there is, of course, room for further improvement for the overall App flow, in general, and its delivery and runtime part, in particular. Having set the goal to iteratively develop the platform for Containerized Apps that fulfills the requirements and constraints imposed by IoT & Edge devices, we strive to continue with further improvements of the overall stack. The next blog of this series will be about making sure that App updates do not fill up a device's storage and that unused App data are not accumulated after each update.