Orchestrating Multiple Containers on Embedded Hardware

A container enables an application to be packaged and isolated with its runtime environment. By packaging applications in containers, embedded device manufacturers can more easily maintain consistent behavior and functionality while moving the contained application from the development lab into the test lab and into production.

Why use containers in embedded Linux applications?

The use of containers in embedded Linux® OS-based devices in fact provides many benefits to developers and manufacturers:

Higher reliability and fault tolerance – if a containerized application crashes, it does not affect the host OS or other containers.
Simpler software updating – traditionally, firmware updates are delivered and deployed as a single monolithic package encompassing the Linux OS distribution, middleware and applications. After updating the Linux distribution, the device must be rebooted. In a containerized architecture, discrete containers can be updated without requiring changes to the kernel or to other containers. This makes it possible to implement an over-the-air firmware update without a reboot – in fact, without the user even noticing that it has happened.
Streamlined dependency management – because each container bundles its own dependencies, multiple applications with conflicting requirements can run on the same device.
Stronger security posture – if one application is compromised, the attacker is contained within a single container and cannot easily move laterally to other services or to the host operating system.
Better portability and collaboration – a container's image built on a development machine can be run on a target device with a compatible kernel: if the software works on a development machine, it will also work on the device.
Streamlined development – the development workload can be distributed efficiently between developers, each working separately on their own containerized application or function.

A containerized system imposes a requirement for an additional layer of system management by comparison with a traditional monolithic system, a function known as container orchestration.

Container orchestration is the process of automating the deployment, management, scaling, and networking of containers throughout their lifecycle. Effective orchestration enables the device manufacturer to deploy software consistently, across fleets of production units, and across the development, testing and production environments.

When monoliths become a liability

The adoption of embedded Linux container orchestration is in part a reaction to the problems which arise in monolithic implementations of complex embedded system designs.

In a monolithic system, shared dependencies slow releases and complicate testing.

Every update – to patch a security vulnerability or to provide new features – has to apply to the entire system, increasing the requirement for network bandwidth, while necessitating a system restart and so causing device downtime.

A monolithic system also increases the difficulty of integrating and updating third-party software components such as communications protocol stacks without risking the stability of the core application.

How to maintain container isolation

Container orchestration relies on three critical interfaces to maintain isolation between containers.

Network interfaces provide the primary communication boundary. Each container requires its own network namespace with a virtual network interface. Containers communicate through defined ports and protocols such as TCP/IP, UDP, or HTTP/REST APIs, never through direct memory access. This approach allows orchestrators to monitor traffic, enforce security policies, and reroute connections during updates or failures.

Filesystem interfaces keep each container's storage separate from other containers' storage. Containers access shared data only through explicitly mounted volumes using overlay filesystems (OverlayFS) or bind mounts. This prevents one container from corrupting another's files, and makes it clear which data persist across container restarts.

Resource interfaces use cgroups (control groups) to limit access to the CPU, memory and I/Os. The orchestrator allocates resources to each container and enforces these limits through the cgroup API. This prevents any single container from consuming all available resources and crashing other services.

These interfaces provide an effective way to police the boundaries between containers because they can be enforced by the orchestrator, monitored in real-time, and configured without modifying application code.

Designing the runtime topology

The runtime topology defines how containers are arranged, and how they interact during operation. The right choice of runtime topology will help the developer to make the most efficient use of scarce memory, processor and power resources while maintaining high reliability.

Container granularity is the first decision. Should each function run in its own container, or should related functions be grouped together? Fine-grained separation improves isolation and allows independent updates, but increases orchestration overhead. Coarse-grained grouping reduces resource usage, but increases the blast radius of system components exposed to the failure of a container. The right balance depends on factors including:

The frequency of software updates
The system's tolerance of failure
The resources available to the system

Service dependencies must then be explicitly mapped. Developers should define which containers depend on others, and the order in which they should start. This prevents race conditions, in which a container tries to connect to a service that has not yet started. Health checks and readiness probes ensure dependencies are actually available, not just running.

Network topology determines how containers discover and communicate with each other. Inter-process communication (IPC) options include direct container-to-container networking, service discovery through DNS, or a service mesh.

For embedded systems, simpler approaches such as static IP assignments or environment variables often work better than complex service discovery mechanisms.

Restart policies define what happens when containers fail. Critical services need to automatically restart with backoff strategies, while diagnostic containers might run once and exit. It is important also to consider the cascading effects of restarts on dependent services.

A lightweight tool for container orchestration on embedded Linux devices

When implementing embedded container orchestration, developers require a management tool which provides the right balance between functionality and resource usage.

The Docker Compose tool, which is integrated into the FoundriesFactory™ DevOps platform for embedded Linux systems, provides the capabilities required for most applications.

Docker Compose uses a simple YAML configuration file which defines all containers, networks, and volumes in one place. This makes the system easy to understand, and enables effective version control.

At the same time, the tool has a much smaller runtime overhead than full comprehensive orchestration platforms – the best known is Kubernetes. Docker Compose runs on resource-constrained systems with as little as 512MB of RAM. The tooling is mature, well-documented, and familiar to most developers.

Developers should be aware that Docker Compose operates only on a single host, and cannot distribute containers across multiple devices. In addition, resource limits must be manually configured and are not dynamically adjusted. If the host fails, Docker Compose cannot migrate containers elsewhere. Recovery from failures requires external monitoring scripts or systemd integration.

Nevertheless, for single-device embedded systems, Docker Compose provides an excellent balance of simplicity and functionality.

Ensuring runtime predictability with cgroups

Runtime predictability requires careful resource allocation to prevent any single container from monopolizing system resources. Cgroups provide the mechanism to apply limits to resource usage.

CPU limits control processor access in two ways. CPU shares define relative priority between containers: a container with 1,024 shares gets twice the CPU time of one with 512 shares when both are competing. CPU quotas set hard limits, restricting a container to a specific percentage of total CPU time.

The allocation of resources is more difficult in embedded systems that implement AI, because of the non-deterministic character of AI inference. In practice, this means that developers must assign guaranteed CPU quotas to mission-critical functions while allowing inference workloads to use the remaining capacity.

Memory limits prevent out-of-memory failures that crash the entire system. Developers must set hard memory limits for each container based on measured usage plus headroom. The kernel will terminate containers which exceed their limit rather than allowing system-wide memory exhaustion.

As with CPU limits, developers must reserve memory for use by essential services such as networking and logging, before allocating the remainder to application workloads.

PID limits restrict the number of processes each container can create, preventing fork bombs or runaway process creation from destabilizing the system.

I/O limits control disk read/write bandwidth and operations per second, ensuring logging or data collection do not interfere with real-time control operations.

These limits can all be configured in Docker Compose's or another orchestration tool's configuration file. After configuration, it is prudent to monitor actual usage in production units, and to adjust the limits based on real-world behavior rather than on assumptions. Experience will show that the limits have to take into account the worst cases of high resource usage, such as loading an AI model, TLS handshakes, or data buffering.

Filesystem strategy for embedded containers

Filesystem design has a direct impact on system reliability and longevity in embedded Linux container deployments. Two practices are essential.

Read-only root filesystems provide the foundation for resilient containers. Mount each container's root filesystem as read-only, preventing the application from modifying its own binaries or libraries. This creates stateless containers which always restart from a known-good configuration.

For temporary data such as cache files or process IDs, mount tmpfs volumes backed by RAM. These disappear on restart, so that every container restart is a clean slate. Read-only roots can also help protect against filesystem corruption during unexpected power loss, a common failure mode in industrial and remote devices.

Minimal persistent state reduces complexity and extends Flash memory life. Developers should identify exactly what data must survive reboots – for instance, device credentials, calibration parameters, security certificates, and critical configurations.

Store these data in explicitly mounted volumes, separate from the container root. Everything else - application logs that can be rotated, cached data, intermediate processing results - should either use tmpfs, or be sent to remote storage. This approach makes updates simpler because it is clear precisely which data to preserve and migrate.

Filesystem strategy can be defined in the orchestration configuration. Make read-only mounts and volume definitions explicit so the separation between ephemeral and persistent data is clear to all developers.

Health checks, watchdogs, and self-healing

Embedded systems often operate in remote locations in which manual intervention is expensive or impossible. Automated health monitoring and recovery are essential to maintain more reliable operation in these circumstances. Best practices for automated monitoring and recovery include:

Liveness and readiness probes – liveness probes detect when a container has failed and needs restarting. This might be, for example, when an application deadlocks or enters an unrecoverable state.

Readiness probes determine when a container is prepared to accept traffic, allowing dependent services to wait until prerequisites are met.

It is prudent to configure both in the container orchestration tool: liveness probes trigger restarts, while readiness probes manage service dependencies during start-up and recovery.

Watchdog integration – this provides hardware-level protection. External watchdog timers reset the entire system if software fails to send periodic heartbeats. Integrate container health checks with the system watchdog so that orchestration failures do not leave the device in a broken state indefinitely.

Auto-restart policies – these policies are used to define recovery behavior. Containers should restart automatically on failure, but with exponential backoff to prevent rapid restart loops which waste resources. Developers should implement circuit breakers which stop restart attempts after repeated failures, and which alert monitoring systems. This prevents containers from endlessly crashing and filling CPU and storage writing error logs.

Designing actionable health checks

Effective health checks must verify actual functionality, not just basic availability. Checking that a port is open shows that the process is running, but not whether it can perform its intended function. Instead, probe the operational logic underlying a system state: can the container read sensor data, access the database, or execute a calculation? A meaningful health check exercises the critical path through the application.

An actionable health check shows when a container has suffered some sort of malfunction. It is not enough to know that a fault has occurred: container orchestration should prepare the embedded Linux device to handle the fault condition effectively.

Developers should design containers to degrade gracefully rather than to fail completely. If Linux containers in IoT devices lose their connection to a remote service, for instance, they should continue to provide local functionality while reporting their degraded status through readiness probes. This allows dependent services to route around problems, and enables the embedded Linux device to maintain partial operation instead of producing a total system failure.

Telemetry and root cause analysis

Telemetry must capture sufficient information to diagnose problems in devices in the field without overwhelming their limited storage or bandwidth. It is good practice to log container restarts, resource limit violations, health check failures, and application-specific error conditions. These logs should include timestamps, container identifiers, and contextual data such as sensor readings or API response codes. These minimal metrics enable root cause analysis of failures that occur in remote deployments.

Log redaction removes personally identifiable information and sensitive data before transmission. Embedded Linux device developers should configure offline buffering to capture telemetry locally when network connectivity is unavailable, then forward accumulated data when the connection returns. It is important to size buffers appropriately for the expected duration of the offline state and for the available Flash data storage, using log rotation to prevent storage capacity from becoming exhausted.

Optimizing container orchestration for over-the-air updating

Container orchestration transforms firmware updates from risky all-or-nothing operations into controlled, reversible deployments.

Deployment channels enable progressive rollouts across populations of devices. It is best to organize devices into classes - development, staging, and production - with each class pinned to a specific update channel. New container images deploy first to development devices for testing, then are promoted to staging for broader validation, and finally to production. This staged approach catches problems before they can affect an entire fleet. Channel assignments can be based on the location of the device, customer status, or hardware version.

Atomic updates allow individual services to update independently without full system reflashing. Each container image has a version tag. The orchestrator pulls the new image, starts a container from it, verifies that health checks pass, then stops the old container. If health checks fail, the orchestrator automatically rolls back to the previous version's image. This granular update capability means that, for instance, fixing a bug in a data collection service does not require an update of the communication stack or control logic as well.

Version compatibility requires explicit co-ordination. It is good practice to define version compatibility ranges in the way that the container orchestration is configured. For example, this might mean specifying that API container version 2.x requires database container version 3.1 or higher.

The orchestrator validates compatibility before applying updates, preventing deployment of incompatible service combinations that would cause runtime failures.

Release engineering patterns

Version compatibility matrices in the source repository describe which versions of a container work together. This prevents the deployment of incompatible combinations, and provides a reference when diagnosing problems with devices in the field.

The implementation of effective release engineering patterns can be performed in one of two ways:

Blue/green deployments maintain two complete environments. Deploy updates to the inactive environment, verify functionality, then switch traffic over. This enables instant rollback by switching back if problems emerge.

Canary deployments update a small subset of devices first, monitoring for errors before expanding the rollout. For distributed fleets, select canary devices across different geographic locations and usage patterns. Automatically halt the rollout if canary devices show high failure rates, protecting the broader fleet from the installation of defective updates.

Container orchestration for security and compliance

Container orchestration provides verifiable security controls which can help device manufacturers to meet regulatory requirements, like the European CSA, and it helps to respond to audit tests imposed by customers.

Effective container orchestration provides for:

Image signing and provenance – these establish trust in deployed software. Developers should sign container images with cryptographic keys during the CI/CD build process, and configure Compose Docker or another orchestration tool to reject unsigned or incorrectly signed images.

Software bill-of-materials (SBOM) documents attached to each image list all components and dependencies with their versions and known vulnerabilities. These should include provenance data which record exactly how and when each image was built, from which source code commit, and by which build system. This chain of evidence proves that production containers match audited and approved source code.

Audit trail – a document of all orchestration actions for compliance reporting. The orchestration system should support the logging of every container deployment, update, restart, and configuration change with timestamps and responsible parties.

The audit trail should include a record of which image versions ran on each device and when. Security events such as failed health checks, resource limit violations, or unauthorized access attempts need to be logged. These immutable logs can help to meet cybersecurity standards requirements such as IEC 62443 for industrial automation systems, and provide evidence for use during security incidents or customer audits.

Optimizing containers for embedded systems

Container design substantially affects an embedded Linux system's security, the efficiency with which it implements firmware updates, and its runtime performance. There are three good practices to consider when optimizing container design for constrained embedded environments.

Multi-stage builds and distroless bases minimize container size and attack surface. Use multi-stage Dockerfiles which compile code in a development environment, then copy only the final binaries into a minimal runtime image. Distroless base images contain only the application and its essential runtime dependencies: they have no shell, no package manager, and no unnecessary libraries.

This reduces image size by 50-90%, cutting the amount of network bandwidth used to deliver over-the-air updates, as well as reducing download time. Smaller images also mean fewer components that could contain vulnerabilities, helping to reduce security exposure.

Lazy-loading strategies defer loading large assets until needed. For AI inference applications with multiple models, the system should load only the active model into memory rather than all possible models at startup. In addition, it should stream firmware or map data on demand, instead of bundling everything in the container.

These approaches, however, need to be balanced against reliability: critical assets should load at boot when network connectivity is available, building warm caches before the device enters operational mode. This prevents runtime failures that occur when required assets cannot be retrieved.

CPU affinity pinning assigns specific containers to specific processor cores, for predictable performance. Container orchestration should ensure that time-critical control loops or real-time data acquisition are pinned to dedicated cores, preventing interference from background tasks such as logging or telemetry.

If possible, at least one core should be reserved for system functions and orchestration operations so that application containers cannot starve essential services. Developers can configure affinity through cgroup cpuset controls in the orchestration configuration, making core assignments explicit and auditable.

Cold-start optimization

Container start-up time directly affects system recovery speed and service availability after power cycles or crashes.

Pre-compilation eliminates runtime interpretation overhead. For Python applications, it is good practice to pre-compile to bytecode (.pyc files) during image build. For Java, use ahead-of-time compilation or pre-warmed JVM images. In addition, developers should minimize initialization scripts: replace complex shell scripts with compiled binaries which execute faster and fail more predictably.

Library pruning reduces load time and memory footprint. Include only the shared libraries that the application actually uses, not entire distribution packages. It is also wise to remove shell utilities such as bash, grep, and sed unless genuinely required - most containerized applications do not need them. Tools such as ldd enable the developer to identify required dependencies, then build minimal images containing only those libraries. This offers the benefit that smaller containers load faster and consume less RAM.

I/O and network optimization

Network efficiency is of crucial importance in those embedded Linux deployments that use bandwidth-constrained cellular or satellite connectivity. Here, there are two main strategies which system developers can adopt.

Batching and QoS reduce transmission overhead. The system should batch telemetry data into periodic uploads rather than sending individual readings, thus minimizing protocol overhead and the cost of establishing a connection. Quality-of-service (QoS) policies enable developers to prioritize control traffic and health reports over bulk data transfers such as logs or diagnostic dumps. The container orchestration system can configure separate network namespaces for critical and non-critical containers to enforce bandwidth limits.

Connection management minimizes handshake overhead. The container orchestration system can enable the resumption of TLS sessions to avoid expensive cryptographic negotiations on reconnection. It should also tune TCP keep-alive parameters to match the characteristics of the network: longer intervals for more reliable connections, shorter intervals for unreliable links to provide for early detection of failures, and enable faster recovery and reconnection.

Validation and testing of containers in embedded Linux systems

Comprehensive testing helps containerized systems behave correctly under real-world conditions that might differ substantially from laboratory environments.

Contract testing validates that containers communicate correctly through their defined interfaces. It is good practice to write tests which verify that each container provides the API, data format, and behavior that dependent containers expect. Run these tests in the CI pipeline whenever interface definitions change.

Chaos engineering tests resilience by injecting failures: randomly kill containers, exhaust memory limits, or saturate CPU to verify that orchestration policies work as designed and that the system recovers gracefully.

Power and connectivity testing simulates field conditions. There is great value in testing unexpected power loss at every stage of the container lifecycle: during startup, mid-operation, and during updates. Developers should verify that containers restart correctly and that persistent data remain intact. Simulate degraded networks with high latency, packet loss, and intermittent connectivity so that that offline buffering works, and the system does not hang while waiting for unavailable services.

Field diagnostics enable remote troubleshooting. Implement diagnostic containers that can be deployed on-demand to collect system state, container logs, and resource utilization without disrupting production workloads. Provide mechanisms to increase log verbosity remotely, to capture network traffic, or to extract core dumps when investigating issues found in devices deployed in the field.

The FoundriesFactory platform: the ideal basis for building containers for embedded devices

Users of the FoundriesFactory DevOps platform for embedded Linux devices can easily incorporate a container-based architecture into their development workflow. This is because containers are natively supported in the FoundriesFactory software, which includes comprehensive Docker and Docker Compose integration.

The purpose of the FoundriesFactory platform is to streamline and systematize management of embedded Linux devices, from prototyping and development, to testing and production, to deployment in the field, maintenance and updating, and through to decommissioning and end of life. The use of containers is consistent with this purpose, as it provides efficiency and reliability benefits at all phases of the lifecycle.

Readers interested in learning more about the FoundriesFactory platform and its support for container-based development should contact Foundries.io to request a personal demonstration of the product.