Investigating AWS and GitHub Actions Runners Network conflict

ArteraAI
7 min readOct 30, 2024

--

By Sufiyan Ghori

In modern DevOps workflows, continuous integration and deployment pipelines are critical for maintaining rapid development cycles. GitHub Actions has become a popular choice for automating these pipelines. However, as infrastructure grows in complexity, unexpected issues can arise, disrupting deployment processes.

One such challenge is network conflicts. In simple terms, a network conflict occurs when two or more hosts in different networks, use the same IP address range, and want to communicate. This overlap causes communications intended for one host to get mixed up with others. These conflicts can lead to connectivity issues, disrupting systems, thus slowing down development cycles.

In this blogpost, we’ll deep dive into an example of this happening at Artera, where a network conflict occurred between private GitHub Actions runners and an AWS resource, leading to a production impact. We’ll look into how Docker assigns network subnets, how GitHub Actions runners create ephemeral networks, and provide best practices to prevent this from happening.

Background

  • Artera uses self-hosted GitHub Actions runners deployed on multiple Amazon EKS clusters. Each EKS cluster hosts its own private runners, dedicated to run jobs specific to that cluster.
  • Each EKS cluster operates within its own isolated Virtual Private Cloud (VPC) for security and segmentation.
  • Cluster communicates with various resources across other VPCs over AWS Transit Gateway
  • The entire infrastructure, including EKS clusters and networking configurations, is defined and managed using Terraform.

The Problem

Recently, we set up a new private GitHub Actions runner in one of Artera’s EKS clusters, us-west-2-cluster, to handle increased workload demands.

While the runners in other clusters were operating without issues, we noticed that jobs executed by the runners in the us-west-2-cluster cluster were consistently failing. Specifically, the job that required connecting to the RDS instance in a different VPC, with the following error,

sqlalchemy.exc.OperationalError: (psycopg2.OperationalError) connection to
server at "aurora-db-1.prod.us-west-2–2.prod" (172.18.41.117),
port 5432 failed: No route to host

Initial Investigation

Our initial high-level understanding is depicted below (simplified),

We began by verifying the usual suspects:

  • DNS Resolution: We confirmed that the RDS endpoint aurora-db-1.prod.us-west-2–2.prod resolved correctly to its IP address 172.18.41.117.
  • Network Connectivity: Checked AWS security groups, network ACLs, and route tables to ensure proper network paths were open.
  • Service Availability: Verified that the RDS instance was up and accepting connections from other sources.

To further test connectivity from within the affected runner, we exec into the pod, and tried connecting to RDS,

runner@us-west-2-cluster-runner:/actions-runner$ curl -v telnet://aurora-db-1.prod.us-west-2–2.prod:5432
* Trying 172.18.41.117:5432…
* Connected to aurora-db-1.prod.us-west-2–2.prod (172.18.41.117) port 5432 (#0)

The runner could establish a TCP connection to the RDS instance on port 5432. So, runner could reach RDS when we connect manually, but not when Github Actions workflow does the same thing? This was interesting.

Understanding the Root Cause

We suspected that the problem might be related to networking within GitHub Actions runner pods. The fact that the runner pod could connect to the database manually, but the job failed to do so hinted at a discrepancy between the runner's network, and the network in which the job's were running.

GitHub Actions Runners and Networking

When using GitHub Actions with self-hosted runners, understanding how it handles networking is crucial, especially when jobs are configured to run inside containers.

When a job does not specify a container specificaiotn, it runs directly on the runner host, For example

jobs:
build-and-push:
runs-on: private-runner
steps:
- name: Build and push Docker image
run: |
docker build -t my-image:latest .
docker push my-image:latest

In this case, Docker commands are executed in the runner host , as shown below (simplified),

However, when a job specifies a container, GitHub Actions runs the job inside the new container attached to a temporary custom bridge network. For example:

jobs:
database-migration:
runs-on: self-hosted
container:
image: my-database-image:latest <-- Will create a new container
steps:
- name: Run database migrations
run: ./migrate.sh

In this scenario, GitHub Actions creates a new, ephemeral user-defined bridge network for the duration of the job. Docker assigns the next available subnet from its default address pool to this network.

Since the default bridge network is using 172.17.0.0/16, the ephemeral network is assigned 172.18.0.0/16. As shown below (simplified),

In our case, the affected job had container specified. This led us to investigate further into how Docker set up networks in the runner pods for containers.

Deep Dive into Docker Networking

Docker enables communication between containers and external resources using networks. By default, Docker creates:

  • Default Bridge Network (bridge): Automatically created when Docker is installed, typically using the subnet 172.17.0.0/16. Containers connected to this default network can communicate with each other via IP addresses.
  • User-Defined Bridge Networks: Custom networks you can create as needed. Docker assigns subnets to these networks from its default address pool, which ranges from 172.17.0.0/16 to 172.31.0.0/16. Each new network increments the second octet of the subnet:
    Default bridge network: 172.17.0.0/16
    First user-defined network: 172.18.0.0/16
    Subsequent networks: 172.19.0.0/16, 172.20.0.0/16, etc.

The Investigation

To diagnose the issue, we accessed the runner pod in the us-west-2-cluster, and started collecting information about the docker network:

$ docker network ls
NETWORK ID NAME DRIVER SCOPE
7cc8616d88b6 bridge bridge local
b0d432482f07 github_network_abcdef1234567890abcdef1234567890 bridge local
42ad11c9f264 host host local
d7eb5b5ae420 none null local

As expected, we observed:

  • Default Bridge Network (bridge): Uses 172.17.0.0/16.
  • GitHub Actions Custom Network: A user-defined bridge network named github_network_abcdef1234567890abcdef1234567890 appeared intermittently.

We noticed that the GitHub Actions custom network was only present when following job was running,

jobs:
job-name:
container:
image: {PRIVATE-IMAGE-URI}

This container specification caused GitHub Actions to run the job inside the new Docker container. As a result, GitHub Actions created a new user-defined network. Which we were able to also verify by looking at the runner log,

2024-10-24T00:48:08.5544605Z stdout F [WORKER 2024-10-24 00:48:08Z INFO ProcessInvokerWrapper] 
Arguments:
'network create --label 6772cd github_network_4ce91a4c26c542f2adf43a40b22914e6'

We then checked the subnets used by these networks

  • Default Bridge Network Subnet
$ docker network inspect bridge --format='{{range .IPAM.Config}}{{.Subnet}}{{end}}'
172.17.0.0/16
  • GitHub Actions Custom Network Subnet
$ docker network inspect github_network_abcdef1234567890abcdef1234567890 --format='{{range .IPAM.Config}}{{.Subnet}}{{end}}'
172.18.0.0/16 # Same subnet as RDS

Remember that the Aurora RDS instance had an IP address within the 172.18.0.0/16 subnet, i.e 172.18.41.117?

Since Docker’s user-defined network also uses 172.18.0.0/16, the container where the job is running, believes the RDS instance is on the same local network.

Basically this (simplified),

It attempts to reach the RDS as it is running in its own network, without routing through the correct interfaces. This is what leads to the issue we were having.

Solution

To resolve the IP conflict, we needed to configure Docker to use a custom address pool that does not overlap with the RDS network (or any of Artera’s existing networks)

In our GitHub Actions runner setup (managed via Terraform), we added the --default-address-pool option to the dockerd. This instructs Docker to allocate subnets from this new IP range instead of 172.17.0.0/16.

- --default-address-pool
- base=172.31.0.0/16,size=24 # We now use this Pool for Github runners jobs

After deploying the above change, we confirmed that Docker was now using the new address pool.

Here’s how the IP addresses were assigned to bridge networks after the update.

  • Default Bridge Network Subnet
$ docker network inspect bridge --format='{{range .IPAM.Config}}{{.Subnet}}{{end}}'
172.31.0.0/24 # Default bridge network IP changed
  • GitHub Actions Custom Network Subnet
$ docker network inspect github_network_4ce91a4c26c542f2adf43a40b22914e6 --format='{{range .IPAM.Config}}{{.Subnet}}{{end}}'
172.31.1.0/24 # This was previously 172.18.0.0/16

and the job which was failing previously was now succeeding!

Best Practices

Always Specify Custom Address Pools

When using Docker, always configure custom address pools that do not overlap with any of your existing subnets, or other network resources. This prevents IP conflicts and routing issues.

Maintain Comprehensive Records of Network CIDR Blocks

  • Keep detailed documentation of all IP ranges (CIDR blocks) used across your infrastructure, including VPCs, subnets, Kubernetes clusters, and Docker networks.
  • Update this documentation whenever changes are made to the network architecture to ensure accuracy.
  • Ensure that this information is easily accessible to all team members involved in infrastructure management.

Consistent Infrastructure Management with IaC Tools

  • Employ tools like Terraform to manage infrastructure configurations consistently across all environments.
  • Keep your IaC code in version control systems to track changes and facilitate collaboration.
  • Leverage automated deployment pipelines to apply configurations reliably and reduce human error.

Conclusion

The debugging journey took us deep into the nuances of Docker networking. It was a reminder that network issues can pop-up in unexpected ways, and that defaults in one system can have profound effects when interacting with another.

While it initially appeared to be a straightforward networking problem, the root cause was an overlap between Docker’s default subnet allocations and AWS VPC subnets.

This experience highlights the importance of a thorough understanding of all layers of your technology stack, and the value of detailed documentation. It reinforces that even well-established defaults may need adjustment to fit the specific needs of a complex environment.

If tackling such infrastructure challenges excites you, consider joining our team. We’re always on the lookout for passionate individuals eager to delve into the depths of cloud infrastructure! You can explore Opportunities on Artera’s Careers page.

--

--

ArteraAI
ArteraAI

Written by ArteraAI

Artera is a leading precision medicine company developing a multi-modal AI platform to personalize cancer therapy, with its first product for prostate cancer.

No responses yet