DevOps

ArgoCD v3.2.5: Critical Patch Release with Stability Improvements

Rui Coelho — Sun, 25 Jan 2026 13:04:28 GMT

Introduction

The ArgoCD community recently released version v3.2.5 on January 14, 2026, replacing v3.2.4 which was marked as invalid. This patch release brings critical fixes that improve the stability and security of the most popular GitOps platform in the Kubernetes ecosystem.

If you’re running ArgoCD in production, especially on 2.x versions (which have reached End of Life), this article explains why you should consider this update and what has changed.

🎯 Why ArgoCD v3.2.5 Matters

The v3.2 Context

ArgoCD v3.2 represents a significant evolution in the 3.x line:

v3.0 (early 2025): Fundamental architectural improvements
v3.1 (August 2025): Native OCI registry support and CLI plugins
v3.2 (November 2025): Advanced features and security fixes
v3.2.5 (January 2026): Critical stabilization

⚠️ Support Warning

ArgoCD 2.14 reached End of Life on November 4, 2025. According to the project’s support policy, only the three most recent minor versions receive security updates:

✅ v3.2.x (current)
✅ v3.1.x
✅ v3.0.x
❌ v2.14 and earlier (unsupported)

🔧 Key Changes in v3.2.5

1. Notifications Engine Update

Commit: fafbd44

feat: Cherry-pick to 3.2 update notifications engine to v0.5.1

The update to notifications engine v0.5.1 brings improvements in notification delivery for:

Slack
Microsoft Teams
Email
Custom webhooks
PagerDuty and others

Practical benefit: Greater reliability in sync notifications, health status, and deployment events.

2. ApplicationSet Reconciliation Fix

Commit: d7d9674

fix(appset): do not trigger reconciliation on appsets not part of 
allowed namespaces when updating a cluster secret

Problem solved: ApplicationSets in non-allowed namespaces no longer trigger unnecessary reconciliations when updating cluster secrets.

Impact:

Reduced computational load
Lower Kubernetes API consumption
More predictable behavior in multi-tenant environments

3. Error Message Improvements

Commit: e6f5403

fix: Only show 'please update resource specification' message when spec is outdated

More precise and contextual error messages, reducing confusion for operators.

4. Dependency Updates

Important commits:

# Go update to version 1.25.5
chore(deps): bump go to 1.25.5 

# expr update to v1.17.7 (security)
chore(cherry-pick-3.2): bump expr to v1.17.7
# Tests against Kubernetes 1.34.2
ci: test against k8s 1.34.2

Guaranteed compatibility with:

✅ Kubernetes 1.32.x
✅ Kubernetes 1.33.x
✅ Kubernetes 1.34.x

🚀 How to Upgrade to v3.2.5

Option 1: Non-HA Installation (Single Instance)

# Create namespace (if needed)
kubectl create namespace argocd

# Apply v3.2.5 manifest
kubectl apply -n argocd -f \
  https://raw.githubusercontent.com/argoproj/argo-cd/v3.2.5/manifests/install.yaml

Option 2: HA Installation (High Availability)

kubectl create namespace argocd

kubectl apply -n argocd -f \
  https://raw.githubusercontent.com/argoproj/argo-cd/v3.2.5/manifests/ha/install.yaml

Option 3: Via Helm Chart

# Add repository
helm repo add argo https://argoproj.github.io/argo-helm
helm repo update

# Upgrade to latest version
helm upgrade argocd argo/argo-cd \
  --namespace argocd \
  --version 9.3.2 \
  --reuse-values

Note: Helm chart 9.3.2 includes ArgoCD v3.2.5.

🔐 Security Verification

All ArgoCD images are signed with Cosign and include SLSA Level 3 Provenance:

# Verify image signature
cosign verify \
  --certificate-identity-regexp "https://github.com/argoproj/argo-cd" \
  --certificate-oidc-issuer "https://token.actions.githubusercontent.com" \
  quay.io/argoproj/argocd:v3.2.5

# Verify provenance
cosign verify-attestation \
  --type slsaprovenance \
  --certificate-identity-regexp "https://github.com/argoproj/argo-cd" \
  --certificate-oidc-issuer "https://token.actions.githubusercontent.com" \
  quay.io/argoproj/argocd:v3.2.5

📊 Compatibility and Support

Supported Architectures

amd64 (x86_64)
arm64 (Apple Silicon, AWS Graviton)
ppc64le (IBM Power)
s390x (IBM Z)

Kubernetes Platforms

Google GKE
Amazon EKS
Azure AKS
Red Hat OpenShift
Rancher
K3s / K0s
Vanilla Kubernetes

🎓 Migrating from v2.x to v3.x

If you’re still on ArgoCD v2.14 or earlier, migration to v3.2.5 is critical for security reasons.

Key Behavioral Changes in v3.x

1. Fine-Grained RBAC by Default

# v2.x: Update permission applied to sub-resources
p, dev-team, applications, update, default/*, allow
# v3.x: Explicit permissions needed for resources
p, dev-team, applications, update, default/*, allow
p, dev-team, applications, update/*/Pod/*, default/*, allow
p, dev-team, applications, update/*/Deployment/*, default/*, allow

2. Tracking by Annotations (Default)

# New default configuration in v3.x
application.resourceTrackingMethod: annotation

3. RBAC on Logs Enabled

# Explicit permissions needed
p, role:developers, logs, get, */*, allow

Detailed Upgrade Guide

Official documentation:

🆕 Featured v3.2 Capabilities

1. Kustomize Version Selection via Git

# .argocd-source.yaml
kustomize:
  version: v5.3.0

You can now specify the Kustomize version directly in your Git repository!

2. Server-Side Diff

# CLI with server-side diff
argocd app diff my-app --server-side-diff

Differences calculated by the Kubernetes API Server = greater accuracy.

3. Pull Request Title Matching

apiVersion: argoproj.io/v1alpha1
kind: ApplicationSet
spec:
  generators:
  - pullRequest:
      github:
        owner: myorg
        repo: myrepo
      filters:
      - title: "Release/.*"  # Filter by title!

4. Health Checks for GitOps Promoter

Full support for resources:

CommitStatus
PullRequest
PromotionStrategy
ChangeTransferPolicy

📈 Performance and Observability

ApplicationSet Resource Limits

To prevent status bloat and etcd limits:

apiVersion: v1
kind: ConfigMap
metadata:
  name: argocd-cmd-params-cm
data:
  # Default limit: 5000 resources
  applicationsetcontroller.status.max.resources.count: "5000"

Recommended Monitoring

# Important metrics to monitor
argocd_app_sync_total
argocd_app_health_status
argocd_app_reconcile_count
argocd_applicationset_status_resources

🔮 ArgoCD Roadmap

v3.3 (Expected February 2026)

Target date: February 2, 2026 (GA)

Expected features:

Additional performance improvements
New notification integrations
UX enhancements in the dashboard

Community Events

ArgoCon Amsterdam 2026

📅 Date: March 23–26, 2026
📍 Location: Co-located with KubeCon EU
🎫 Register: ArgoCon 2026

✅ Upgrade Checklist

Before upgrading to v3.2.5:

Review official release notes
Backup configurations (ConfigMaps, Secrets, CRDs)
Test in staging environment
Validate RBAC policies (if migrating from v2.x)
Verify plugin compatibility
Update internal documentation
Communicate changes to team

Post-upgrade:

Verify health of all applications
Test manual synchronization
Validate notifications
Monitor logs for 24–48h
Review metrics dashboards

🛠️ Common Troubleshooting

Problem: ApplicationSets Reconciling Excessively

Symptom: High CPU load, many Kubernetes API requests

Solution: Upgrading to v3.2.5 fixes this specific bug!

Problem: Notifications Not Arriving

Symptom: Sync events don’t trigger notifications

Solution:

# Check notifications controller version
kubectl get deployment argocd-notifications-controller \
  -n argocd -o yaml | grep image

# Should be v3.2.5

Problem: RBAC Denying Log Access

Symptom: Users cannot see pod logs

Solution: Add explicit RBAC policy:

apiVersion: v1
kind: ConfigMap
metadata:
  name: argocd-rbac-cm
data:
  policy.csv: |
    p, role:developer, logs, get, */*, allow

📚 Additional Resources

Official Documentation

Complementary Tools

Argo Rollouts: Progressive delivery
Argo Workflows: Workflow orchestration
Argo Events: Event-driven automation
ApplicationSet Controller: Multi-cluster app deployment

💡 Conclusion

ArgoCD v3.2.5 represents a critical stability update that all production users should consider. With important fixes in the ApplicationSet controller, dependency updates, and better notification handling, this version solidifies ArgoCD’s position as the reference GitOps solution for Kubernetes.

Recommended action:

If you’re on v3.2.4 → upgrade immediately
If you’re on v3.0–3.1 → plan upgrade in the coming weeks
If you’re on v2.x → urgent upgrade needed (EOL)

The GitOps ecosystem continues to evolve rapidly, and staying up-to-date is not just about features, but about security and supportability.

Terraform vs OpenTofu: A Comprehensive Comparison for Infrastructure as Code

Rui Coelho — Sun, 25 Jan 2026 12:59:43 GMT

The infrastructure as code (IaC) landscape experienced a significant shift in August 2023 when HashiCorp changed Terraform’s license from the Mozilla Public License (MPL) to the Business Source License (BSL). This decision sparked controversy in the open-source community and led to the birth of OpenTofu, a fork of Terraform that maintains the open-source ethos. As organizations evaluate their IaC tooling strategies, understanding the differences, similarities, and implications of choosing between Terraform and OpenTofu has become crucial.

In this article, we’ll dive deep into both tools, compare their features, examine real-world examples, and help you make an informed decision for your infrastructure needs.

The Origin Story: Understanding the Fork

Terraform’s License Change

HashiCorp’s decision to move Terraform from MPL 2.0 to BSL 1.1 was justified by the company as necessary to prevent cloud providers from offering competing managed services without contributing back to the project. While understandable from a business perspective, this change meant that Terraform was no longer truly open source by the Open Source Initiative’s definition.

The BSL allows free use for most purposes but restricts competitive commercial use. After four years, the code converts to MPL 2.0, but this waiting period was enough to concern many organizations that had built their infrastructure automation on the promise of open-source software.

The Birth of OpenTofu

In response, the Linux Foundation announced OpenTofu in September 2023 as a truly open-source alternative. Led by a coalition of companies including Gruntwork, Spacelift, env0, Scalr, and others, OpenTofu aims to maintain backward compatibility with Terraform while providing a community-driven, vendor-neutral alternative.

The project quickly gained momentum, achieving its first general availability release (1.6.0) in January 2024, maintaining parity with Terraform 1.6 while adding new features and improvements.

Core Architecture: More Similar Than Different

At their core, both Terraform and OpenTofu share the same fundamental architecture because OpenTofu is a fork of Terraform 1.5. Understanding this shared foundation is important before we explore their differences.

The HCL Configuration Language

Both tools use HashiCorp Configuration Language (HCL) for defining infrastructure. Here’s a simple example that works identically in both:

# Define an AWS EC2 instance
resource "aws_instance" "web_server" {
  ami           = "ami-0c55b159cbfafe1f0"
  instance_type = "t3.micro"

  tags = {
    Name        = "WebServer"
    Environment = "Production"
    ManagedBy   = "IaC"
  }

  root_block_device {
    volume_size = 20
    volume_type = "gp3"
  }
}

# Create a security group
resource "aws_security_group" "web_sg" {
  name        = "web-server-sg"
  description = "Security group for web server"

  ingress {
    from_port   = 443
    to_port     = 443
    protocol    = "tcp"
    cidr_blocks = ["0.0.0.0/0"]
  }

  egress {
    from_port   = 0
    to_port     = 0
    protocol    = "-1"
    cidr_blocks = ["0.0.0.0/0"]
  }
}

This configuration syntax remains identical between the two tools, which means existing Terraform configurations can generally be used with OpenTofu without modification.

State Management

Both tools use a state file to track resources. The state management approach is conceptually identical:

# Backend configuration for remote state
terraform {
  backend "s3" {
    bucket         = "my-terraform-state"
    key            = "prod/infrastructure.tfstate"
    region         = "us-west-2"
    encrypt        = true
    dynamodb_table = "terraform-locks"
  }
}

This same backend configuration works in OpenTofu, though OpenTofu has added support for additional backend types and enhanced encryption options.

Key Differences: Where They Diverge

While maintaining compatibility, OpenTofu has introduced several differentiating features and improvements.

1. Licensing and Governance

Terraform (BSL 1.1):

Restricts competitive commercial offerings
Four-year delay before converting to MPL 2.0
Controlled by HashiCorp’s business interests

OpenTofu (MPL 2.0):

Truly open source
Community-governed through the Linux Foundation
No restrictions on commercial use
Transparent decision-making process

2. State File Encryption

One of OpenTofu’s most significant innovations is native state file encryption. While Terraform relies on backend-level encryption (like S3 encryption), OpenTofu provides client-side encryption:

# OpenTofu state encryption configuration
terraform {
  encryption {
    key_provider "pbkdf2" "mykey" {
      passphrase = var.state_passphrase
    }

    method "aes_gcm" "state_encryption" {
      keys = key_provider.pbkdf2.mykey
    }

    state {
      method = method.aes_gcm.state_encryption
    }
  }
}

This ensures that sensitive data in the state file is encrypted before it leaves your machine, providing an additional layer of security that Terraform doesn’t offer natively.

3. Enhanced Testing Framework

Both Terraform and OpenTofu support infrastructure testing through the test command introduced in Terraform 1.6. Here's an example of a test file that works in both tools:

# Test file: tests/vpc_test.tftest.hcl
variables {
  environment = "test"
  vpc_cidr    = "10.0.0.0/16"
}

run "validate_vpc_creation" {
  command = apply

  assert {
    condition     = aws_vpc.main.cidr_block == var.vpc_cidr
    error_message = "VPC CIDR does not match expected value"
  }

  assert {
    condition     = length(aws_subnet.private) == 3
    error_message = "Expected 3 private subnets"
  }
}

run "validate_internet_gateway" {
  command = plan

  assert {
    condition     = aws_internet_gateway.main.vpc_id == aws_vpc.main.id
    error_message = "Internet gateway not attached to VPC"
  }
}

Both tools provide similar testing capabilities, making infrastructure testing more accessible to teams.

4. Provider Development and Registry

Terraform:

Uses the official Terraform Registry (registry.terraform.io)
Provider development controlled by HashiCorp
BSL applies to provider development kit

OpenTofu:

Currently uses the same Terraform Registry providers
Has its own registry infrastructure (registry.opentofu.org)
Working on provider ecosystem independence
Maintains compatibility with existing providers
Planning to mirror and potentially fork critical providers if needed

Here’s how you declare providers (syntax is identical in both tools):

# Provider declaration works identically in both Terraform and OpenTofu
terraform {
  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 5.0"
    }

    kubernetes = {
      source  = "hashicorp/kubernetes"
      version = "~> 2.23"
    }
  }
}

Currently, both tools use the same provider sources. OpenTofu’s registry serves as a fallback and metadata repository, ensuring long-term availability even if HashiCorp’s registry policies change.

Real-World Comparison: A Practical Example

Let’s examine a complete, real-world scenario: deploying a multi-tier application infrastructure.

Complete Infrastructure Module

# variables.tf
variable "project_name" {
  description = "Name of the project"
  type        = string
}

variable "environment" {
  description = "Environment name"
  type        = string
  validation {
    condition     = contains(["dev", "staging", "prod"], var.environment)
    error_message = "Environment must be dev, staging, or prod"
  }
}

variable "vpc_cidr" {
  description = "CIDR block for VPC"
  type        = string
  default     = "10.0.0.0/16"
}

# main.tf
terraform {
  required_version = ">= 1.6"

  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 5.0"
    }
  }
}

# VPC Configuration
resource "aws_vpc" "main" {
  cidr_block           = var.vpc_cidr
  enable_dns_hostnames = true
  enable_dns_support   = true

  tags = {
    Name        = "${var.project_name}-${var.environment}-vpc"
    Environment = var.environment
    ManagedBy   = "IaC"
  }
}

# Public Subnets
resource "aws_subnet" "public" {
  count             = 3
  vpc_id            = aws_vpc.main.id
  cidr_block        = cidrsubnet(var.vpc_cidr, 4, count.index)
  availability_zone = data.aws_availability_zones.available.names[count.index]

  map_public_ip_on_launch = true

  tags = {
    Name        = "${var.project_name}-${var.environment}-public-${count.index + 1}"
    Type        = "Public"
    Environment = var.environment
  }
}

# Private Subnets
resource "aws_subnet" "private" {
  count             = 3
  vpc_id            = aws_vpc.main.id
  cidr_block        = cidrsubnet(var.vpc_cidr, 4, count.index + 3)
  availability_zone = data.aws_availability_zones.available.names[count.index]

  tags = {
    Name        = "${var.project_name}-${var.environment}-private-${count.index + 1}"
    Type        = "Private"
    Environment = var.environment
  }
}

# Application Load Balancer
resource "aws_lb" "app" {
  name               = "${var.project_name}-${var.environment}-alb"
  internal           = false
  load_balancer_type = "application"
  security_groups    = [aws_security_group.alb.id]
  subnets            = aws_subnet.public[*].id

  enable_deletion_protection = var.environment == "prod" ? true : false

  tags = {
    Name        = "${var.project_name}-${var.environment}-alb"
    Environment = var.environment
  }
}

# Auto Scaling Group
resource "aws_autoscaling_group" "app" {
  name                = "${var.project_name}-${var.environment}-asg"
  vpc_zone_identifier = aws_subnet.private[*].id
  target_group_arns   = [aws_lb_target_group.app.arn]
  health_check_type   = "ELB"

  min_size         = var.environment == "prod" ? 3 : 1
  max_size         = var.environment == "prod" ? 10 : 3
  desired_capacity = var.environment == "prod" ? 3 : 1

  launch_template {
    id      = aws_launch_template.app.id
    version = "$Latest"
  }

  tag {
    key                 = "Name"
    value               = "${var.project_name}-${var.environment}-app"
    propagate_at_launch = true
  }
}

# RDS Database
resource "aws_db_instance" "main" {
  identifier     = "${var.project_name}-${var.environment}-db"
  engine         = "postgres"
  engine_version = "15.3"
  instance_class = var.environment == "prod" ? "db.t3.medium" : "db.t3.micro"

  allocated_storage     = var.environment == "prod" ? 100 : 20
  storage_type          = "gp3"
  storage_encrypted     = true

  db_name  = "${var.project_name}_${var.environment}"
  username = "dbadmin"
  password = random_password.db_password.result

  vpc_security_group_ids = [aws_security_group.database.id]
  db_subnet_group_name   = aws_db_subnet_group.main.name

  backup_retention_period = var.environment == "prod" ? 7 : 1
  skip_final_snapshot     = var.environment != "prod"

  tags = {
    Name        = "${var.project_name}-${var.environment}-db"
    Environment = var.environment
  }
}

# outputs.tf
output "vpc_id" {
  description = "ID of the VPC"
  value       = aws_vpc.main.id
}

output "alb_dns_name" {
  description = "DNS name of the load balancer"
  value       = aws_lb.app.dns_name
}

output "database_endpoint" {
  description = "Connection endpoint for the database"
  value       = aws_db_instance.main.endpoint
  sensitive   = true
}

This module works identically in both Terraform and OpenTofu. However, with OpenTofu, you could add the native encryption layer:

# OpenTofu-specific enhancement
terraform {
  encryption {
    key_provider "aws_kms" "state" {
      kms_key_id = "arn:aws:kms:us-west-2:111122223333:key/1234abcd"
      key_spec   = "AES_256"
    }

    method "aes_gcm" "state_encryption" {
      keys = key_provider.aws_kms.state
    }

    state {
      method = method.aes_gcm.state_encryption
    }
  }
}

Migration: Moving Between Terraform and OpenTofu

One of the most common questions is how difficult it is to migrate between these tools. The good news: it’s relatively straightforward.

Migrating from Terraform to OpenTofu

# Step 1: Install OpenTofu
# Using Homebrew (macOS/Linux)
brew install opentofu

# Using package manager (Ubuntu/Debian)
curl --proto '=https' --tlsv1.2 -fsSL https://get.opentofu.org/install-opentofu.sh | sh

# Step 2: Initialize with existing state
cd your-terraform-project
tofu init -upgrade

# Step 3: Verify plan
tofu plan

# Step 4: Apply (if everything looks good)
tofu apply

The migration is seamless because OpenTofu maintains state file compatibility. You can switch back to Terraform if needed, though you’d lose OpenTofu-specific features.

Migrating from OpenTofu to Terraform

# Step 1: Remove OpenTofu-specific features
# Comment out or remove encryption blocks and OpenTofu-only syntax

# Step 2: Initialize with Terraform
terraform init -upgrade

# Step 3: Verify and apply
terraform plan
terraform apply

Gradual Migration Strategy

For large organizations, a gradual approach might be preferable:

# Module that works with both tools
module "networking" {
  source = "./modules/networking"

  # Use only compatible features
  vpc_cidr    = "10.0.0.0/16"
  environment = var.environment
}

# Conditional encryption (OpenTofu only)
dynamic "encryption" {
  for_each = can(regex("tofu", version.current)) ? [1] : []

  content {
    # OpenTofu-specific encryption config
  }
}

Feature Comparison Matrix

Let’s break down the key differences in a structured comparison:

Core Functionality

Feature	Terraform	OpenTofu	Notes
HCL Syntax	✅	✅	Identical
State Management	✅	✅	Compatible
Provider Ecosystem	✅	✅	OpenTofu working toward independence
Module Support	✅	✅	Fully compatible
Workspaces	✅	✅	Identical functionality
Remote Backends	✅	✅	OpenTofu has additional options

Advanced Features

Feature	Terraform	OpenTofu	Notes
State Encryption	Backend-level	Native client-side	OpenTofu advantage
Testing Framework	Basic	Enhanced	OpenTofu has more features
For-each with sensitive	⚠️ Limited	✅ Full support	OpenTofu improvement
Removed block	✅ 1.7+	✅ 1.6+	OpenTofu implemented first

Operational Aspects

Aspect	Terraform	OpenTofu	Notes
License	BSL 1.1	MPL 2.0	OpenTofu is truly open source
Governance	HashiCorp	Linux Foundation	Community vs. corporate
Release Cycle	~4 months	~2-3 months	OpenTofu moves faster
Performance	Baseline	10-30% faster	For large deployments
Cloud Provider Support	Excellent	Excellent	Both well-supported

Real-World Use Cases and Recommendations

When to Choose Terraform

Scenario 1: HashiCorp Ecosystem Integration

If you’re heavily invested in HashiCorp products (Vault, Consul, Nomad), Terraform might provide smoother integration:

# Using Terraform with HashiCorp Vault
data "vault_generic_secret" "database" {
  path = "secret/database/${var.environment}"
}

resource "aws_db_instance" "main" {
  # ...
  username = data.vault_generic_secret.database.data["username"]
  password = data.vault_generic_secret.database.data["password"]
}

Scenario 2: Enterprise Support Requirements

Organizations requiring commercial support and SLAs from HashiCorp:

# Terraform Cloud/Enterprise features
terraform {
  cloud {
    organization = "my-company"

    workspaces {
      name = "production-infrastructure"
    }
  }
}

Scenario 3: Risk-Averse Organizations

Companies that prefer stability over innovation and have legal concerns about switching tools.

When to Choose OpenTofu

Scenario 1: Open Source Commitment

Organizations with strong open-source requirements can benefit from OpenTofu’s MPL 2.0 license and community governance model, while still using the same provider ecosystem.

Scenario 2: Security-First Environments

When client-side encryption is a requirement:

# OpenTofu state encryption for compliance
terraform {
  encryption {
    key_provider "pbkdf2" "compliance" {
      passphrase = var.encryption_key
      key_length = 32
      iterations = 600000
    }

    method "aes_gcm" "state" {
      keys = key_provider.pbkdf2.compliance
    }

    state {
      method = method.aes_gcm.state
    }

    plan {
      method = method.aes_gcm.state
    }
  }
}

Scenario 3: Community-Driven Innovation

Organizations that want to influence tool development:

# Using cutting-edge OpenTofu features
terraform {
  required_version = "~> 1.7"

  experiments = [
    early_evaluation
  ]
}

Scenario 4: Cost Optimization

Teams looking to avoid potential future licensing costs:

# No vendor lock-in concerns
tofu init
tofu apply
# Free forever, no upgrade pressure

Community and Ecosystem

Community Support

Terraform:

Larger existing community (established 2014)
More Stack Overflow questions and answers
Extensive documentation and tutorials
HashiCorp-led conferences and events

OpenTofu:

Rapidly growing community
Active GitHub discussions and contributions
Linux Foundation backing
More transparent governance process

Provider Availability

Currently, both tools use the same provider ecosystem from the Terraform Registry:

# Provider configuration works identically in both tools
terraform {
  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 5.0"
    }

    kubernetes = {
      source  = "hashicorp/kubernetes"
      version = "~> 2.23"
    }

    google = {
      source  = "hashicorp/google"
      version = "~> 5.0"
    }
  }
}

OpenTofu maintains its own registry (registry.opentofu.org) that mirrors the Terraform Registry, ensuring long-term provider availability. The OpenTofu community is also working on a strategy to fork and maintain critical providers if HashiCorp’s licensing changes make this necessary, though currently all providers work seamlessly with both tools.

Future Outlook

Terraform’s Direction

HashiCorp is focusing on:

Enterprise features and Terraform Cloud
Improved CDKTF (Cloud Development Kit for Terraform)
Enhanced policy and governance
AI-assisted infrastructure coding

OpenTofu’s Roadmap

The OpenTofu project is prioritizing:

Complete provider registry independence
Enhanced testing and validation features
Improved performance optimizations
Community-driven feature development
Better integration with CI/CD pipelines

Decision Framework

Here’s a practical framework for choosing between these tools:

Decision Tree

Start Here
    |
    ├─ Do you require open source licensing?
    │   ├─ Yes → Consider OpenTofu
    │   └─ No → Continue
    |
    ├─ Do you need HashiCorp enterprise support?
    │   ├─ Yes → Choose Terraform
    │   └─ No → Continue
    |
    ├─ Is state file encryption critical?
    │   ├─ Yes → OpenTofu has advantage
    │   └─ No → Continue
    |
    ├─ Do you want community governance?
    │   ├─ Yes → Choose OpenTofu
    │   └─ No → Either works
    |
    └─ Default → Both are excellent choices

Evaluation Checklist

For organizations evaluating their options:

Technical Requirements:

Current Terraform version compatibility
Provider availability and versions needed
State management requirements
Encryption and security needs
Testing framework needs
Specific feature requirements (e.g., state encryption, provider functions)

Organizational Factors:

Open source policy compliance
Budget for commercial support
Risk tolerance for tool switching
Team expertise and training needs
Long-term strategic alignment
Community vs. vendor relationship preference

Operational Considerations:

CI/CD pipeline integration
Existing toolchain compatibility
Migration complexity and effort
Backup and disaster recovery processes
Compliance and audit requirements

Conclusion

Both Terraform and OpenTofu are powerful, production-ready tools for infrastructure as code. The choice between them ultimately depends on your organization’s priorities:

Choose Terraform if:

You require HashiCorp’s commercial support and enterprise features
You’re deeply integrated with the HashiCorp ecosystem
You prefer the stability of an established, well-funded vendor
Your organization is risk-averse about tool changes

Choose OpenTofu if:

Open source licensing is a requirement or strong preference
You want community-driven governance and development
Client-side state encryption is important for your security model
You value independence from vendor licensing changes
You want to support and influence community-driven infrastructure tooling

The Good News: Regardless of your choice, both tools:

Use the same HCL syntax
Maintain state file compatibility
Support the same providers (for now)
Allow relatively easy migration between them

For most teams, OpenTofu represents a safe, forward-looking choice that embraces open source principles while maintaining compatibility with the Terraform ecosystem. However, organizations with specific enterprise needs or HashiCorp relationships may find Terraform continues to serve them well.

The infrastructure as code landscape is healthier for having both options. Competition drives innovation, and the existence of OpenTofu has already influenced Terraform’s development priorities. As users, we benefit from this diversity.

Ultimately, the best approach is to evaluate both tools against your specific requirements, run proof-of-concept projects, and make an informed decision based on your organization’s unique needs. Both paths lead to effective infrastructure automation — the question is which better aligns with your values, requirements, and long-term strategy.

What’s your experience with Terraform and OpenTofu? Have you made the switch, or are you staying with Terraform? Share your thoughts and experiences in the comments below.

Building Your First GitHub Custom Action: A Step-by-Step Guide

Rui Coelho — Sun, 25 Jan 2026 12:56:51 GMT

Why Custom Actions Matter

If you’re using GitHub Actions for CI/CD, you’ve probably noticed yourself writing the same workflow steps over and over. Maybe you’re always checking PR sizes, validating commit messages, or posting notifications to Slack. This repetition is a perfect opportunity for a custom action.

In this guide, I’ll walk you through creating a practical GitHub Custom Action from scratch: a PR Size Checker that automatically labels pull requests based on their size and suggests splitting large PRs.

By the end of this article, you’ll understand:

The anatomy of a GitHub Action
How to build one using JavaScript
How to bundle and publish your action
Best practices for versioning and automation

What We’re Building

Our PR Size Checker will:

Calculate total lines changed in a pull request
Apply labels: small, medium, large, or extra-large
Automatically create these labels if they don’t exist
Comment on oversized PRs suggesting they be split
Be configurable with custom thresholds

This solves a real problem: large PRs slow down code reviews and increase the chance of bugs slipping through. Automated labeling helps teams prioritize reviews and encourages better practices.

Prerequisites

Before we start, you’ll need:

Node.js 20 or higher
A GitHub account
Basic knowledge of JavaScript and GitHub Actions
A repository where you can test your action

Project Structure

Here’s what our final project will look like:

github-custom-action-examples/
├── action.yml              # Action metadata
├── index.js               # Main logic (source code)
├── dist/
│   └── index.js          # Bundled code (commit this!)
├── package.json          # Dependencies
├── package-lock.json
├── README.md
└── .github/
    └── workflows/
        ├── release.yml                # Automated releases
        └── pr-size-check.yml          # Example usage

The key thing to understand: we write code in index.js, but GitHub Actions runs dist/index.js (the bundled version). More on that later.

Step 1: Setting Up the Project

Create a new repository and initialize it:

mkdir pr-size-checker
cd pr-size-checker
npm init -y

Install the required dependencies:

npm install @actions/core @actions/github
npm install --save-dev @vercel/ncc

What are these packages?

@actions/core: Provides functions for inputs, outputs, and logging
@actions/github: Gives access to GitHub API and webhook payload
@vercel/ncc: Bundles your code and dependencies into a single file

Update your package.json with build scripts:

{
  "scripts": {
    "build": "ncc build index.js -o dist"
  }
}

Step 2: Creating the Action Metadata

The action.yml file is your action's configuration. It defines inputs, outputs, and how to run the action.

Create action.yml:

name: 'PR Size Checker'
description: 'Automatically checks Pull Request size and adds appropriate labels'
author: 'AutomationDojo'

branding:
  icon: 'git-pull-request'
  color: 'blue'

inputs:
  github-token:
    description: 'GitHub token for API calls'
    required: true

  small-threshold:
    description: 'Maximum lines changed for a small PR'
    required: false
    default: '100'

  medium-threshold:
    description: 'Maximum lines changed for a medium PR'
    required: false
    default: '300'

  large-threshold:
    description: 'Maximum lines changed for a large PR'
    required: false
    default: '600'

  comment-on-large:
    description: 'Whether to comment on large PRs'
    required: false
    default: 'true'

outputs:
  size-label:
    description: 'Label applied to the PR (small, medium, large, extra-large)'

  lines-changed:
    description: 'Total number of lines changed'

runs:
  using: 'node20'
  main: 'dist/index.js'

Key points:

inputs: Parameters users can configure
outputs: Values your action returns (useful for chaining actions)
runs.main: Points to the bundled file, not the source
branding: How your action appears in the GitHub Marketplace

Step 3: Writing the Action Logic

Now for the core functionality. Create index.js:

const core = require('@actions/core');
const github = require('@actions/github');

async function run() {
  try {
    // Get inputs
    const token = core.getInput('github-token', { required: true });
    const smallThreshold = parseInt(core.getInput('small-threshold'));
    const mediumThreshold = parseInt(core.getInput('medium-threshold'));
    const largeThreshold = parseInt(core.getInput('large-threshold'));
    const commentOnLarge = core.getInput('comment-on-large') === 'true';

    // Initialize GitHub client
    const octokit = github.getOctokit(token);
    const context = github.context;

    // Ensure this is a pull request event
    if (!context.payload.pull_request) {
      core.setFailed('This action only works on pull_request events');
      return;
    }

    const pr = context.payload.pull_request;
    const owner = context.repo.owner;
    const repo = context.repo.repo;
    const prNumber = pr.number;

    // Calculate total lines changed
    const additions = pr.additions || 0;
    const deletions = pr.deletions || 0;
    const totalChanges = additions + deletions;

    core.info(`PR #${prNumber} has ${totalChanges} lines changed`);
    core.info(`Additions: ${additions}, Deletions: ${deletions}`);

    // Determine size label
    let sizeLabel;
    if (totalChanges <= smallThreshold) {
      sizeLabel = 'small';
    } else if (totalChanges <= mediumThreshold) {
      sizeLabel = 'medium';
    } else if (totalChanges <= largeThreshold) {
      sizeLabel = 'large';
    } else {
      sizeLabel = 'extra-large';
    }

    core.info(`Size determined: ${sizeLabel}`);

    // Define label configurations
    const labelConfigs = {
      'small': { color: '0e8a16', description: 'Small PR, easy to review' },
      'medium': { color: 'fbca04', description: 'Medium-sized PR' },
      'large': { color: 'e99695', description: 'Large PR, consider splitting' },
      'extra-large': { color: 'd93f0b', description: 'Very large PR, splitting recommended' }
    };

    // Ensure all size labels exist
    for (const [labelName, config] of Object.entries(labelConfigs)) {
      try {
        await octokit.rest.issues.createLabel({
          owner,
          repo,
          name: labelName,
          color: config.color,
          description: config.description
        });
        core.info(`Created label: ${labelName}`);
      } catch (error) {
        if (error.status === 422) {
          core.info(`Label ${labelName} already exists`);
        } else {
          throw error;
        }
      }
    }

    // Get current labels
    const { data: currentLabels } = await octokit.rest.issues.listLabelsOnIssue({
      owner,
      repo,
      issue_number: prNumber
    });

    // Remove old size labels
    const sizeLabels = ['small', 'medium', 'large', 'extra-large'];
    for (const label of currentLabels) {
      if (sizeLabels.includes(label.name) && label.name !== sizeLabel) {
        await octokit.rest.issues.removeLabel({
          owner,
          repo,
          issue_number: prNumber,
          name: label.name
        });
        core.info(`Removed old label: ${label.name}`);
      }
    }

    // Add new size label
    await octokit.rest.issues.addLabels({
      owner,
      repo,
      issue_number: prNumber,
      labels: [sizeLabel]
    });
    core.info(`Added label: ${sizeLabel}`);

    // Comment on large PRs
    if (commentOnLarge && (sizeLabel === 'large' || sizeLabel === 'extra-large')) {
      const commentBody = `⚠️ **Large Pull Request Detected**

This PR has **${totalChanges} lines changed**. Large PRs can be difficult to review thoroughly and may slow down the development process.

**Consider:**
- Breaking this PR into smaller, focused changes
- Each PR should ideally address a single concern
- Smaller PRs are easier to review, test, and merge

If this PR must remain large, please ensure it has:
- ✅ Comprehensive description
- ✅ Clear testing instructions
- ✅ Appropriate documentation updates`;

      // Check if we already commented
      const { data: comments } = await octokit.rest.issues.listComments({
        owner,
        repo,
        issue_number: prNumber
      });

      const botComment = comments.find(
        comment => comment.user.type === 'Bot' && 
                   comment.body.includes('Large Pull Request Detected')
      );

      if (!botComment) {
        await octokit.rest.issues.createComment({
          owner,
          repo,
          issue_number: prNumber,
          body: commentBody
        });
        core.info('Added comment suggesting PR split');
      } else {
        core.info('Comment already exists, skipping');
      }
    }

    // Set outputs
    core.setOutput('size-label', sizeLabel);
    core.setOutput('lines-changed', totalChanges);

    core.info(`✅ Successfully processed PR #${prNumber}`);

  } catch (error) {
    core.setFailed(`Action failed: ${error.message}`);
  }
}

run();

Let’s break down the key parts:

Reading Inputs

const token = core.getInput('github-token', { required: true });
const smallThreshold = parseInt(core.getInput('small-threshold'));

The core.getInput() function reads values from the workflow file. Users can override defaults you set in action.yml.

Accessing GitHub Context

const octokit = github.getOctokit(token);
const context = github.context;
const pr = context.payload.pull_request;

The github.context object contains information about the workflow run, including the pull request payload with additions, deletions, and other metadata.

Creating Labels

await octokit.rest.issues.createLabel({
  owner,
  repo,
  name: labelName,
  color: config.color,
  description: config.description
});

We create labels if they don’t exist. The try-catch handles the case where they already exist (422 error).

Managing Labels

// Remove old labels
await octokit.rest.issues.removeLabel({...});

// Add new label
await octokit.rest.issues.addLabels({...});

We remove old size labels before adding the new one to keep things clean.

Setting Outputs

core.setOutput('size-label', sizeLabel);
core.setOutput('lines-changed', totalChanges);

Outputs allow other workflow steps to use your action’s results.

Step 4: Bundling Your Code

GitHub Actions doesn’t install node_modules for you. You need to bundle everything into a single file using @vercel/ncc:

npm run build

This creates dist/index.js containing your code and all dependencies. You must commit this file to your repository.

Add to .gitignore:

node_modules/

But don’t ignore dist/—GitHub needs it to run your action.

Step 5: Using Your Action

Create .github/workflows/pr-size-check.yml to test your action:

name: PR Size Check

on:
  pull_request:
    types: [opened, synchronize, reopened]

jobs:
  check-pr-size:
    runs-on: ubuntu-latest
    name: Check PR Size

    steps:
      - name: Check PR Size
        uses: ./  # Use local action for testing
        with:
          github-token: ${{ secrets.GITHUB_TOKEN }}
          small-threshold: 100
          medium-threshold: 300
          large-threshold: 600
          comment-on-large: true

For local testing:

uses: ./ runs the action from the current repository
Perfect for development and testing

For production use:

uses: AutomationDojo/github-custom-action-examples@v1.1.0

Step 6: Versioning and Releases

Managing versions manually is tedious. Let’s automate it with Semantic Release.

Install Semantic Release:

npm install --save-dev semantic-release @semantic-release/changelog @semantic-release/git

Create .releaserc.yml:

branches:
  - main

plugins:
  - '@semantic-release/commit-analyzer'
  - '@semantic-release/release-notes-generator'
  - '@semantic-release/changelog'
  - '@semantic-release/npm'
  - - '@semantic-release/git'
    - assets:
        - CHANGELOG.md
        - package.json
        - package-lock.json
      message: 'chore(release): ${nextRelease.version} [skip ci]\n\n${nextRelease.notes}'
  - '@semantic-release/github'

Create .github/workflows/release.yml:

name: Release

on:
  push:
    branches:
      - main

jobs:
  release:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
        with:
          fetch-depth: 0
          persist-credentials: false

      - name: Setup Node.js
        uses: actions/setup-node@v4
        with:
          node-version: '20'

      - name: Install dependencies
        run: npm ci

      - name: Build
        run: npm run build

      - name: Release
        env:
          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
        run: npx semantic-release

Now use Conventional Commits format:

git commit -m "feat: add support for custom label colors"
git commit -m "fix: resolve issue with label removal"
git commit -m "docs: update README with examples"

When you push to main, Semantic Release:

Analyzes commits to determine version bump
Generates CHANGELOG.md
Creates a GitHub release
Updates package.json

Step 7: Writing Good Documentation

Your README should include:

# PR Size Checker

A GitHub Action that automatically labels PRs based on size.

## Usage

\`\`\`yaml
- uses: AutomationDojo/github-custom-action-examples@v1.1.0
  with:
    github-token: ${{ secrets.GITHUB_TOKEN }}
    small-threshold: 100
\`\`\`

## Inputs

| Input | Description | Required | Default |
|-------|-------------|----------|---------|
| github-token | GitHub token | Yes | - |
| small-threshold | Max lines for small PR | No | 100 |

## Outputs

| Output | Description |
|--------|-------------|
| size-label | Applied label |
| lines-changed | Total lines changed |

## Example

\`\`\`yaml
- id: check-size
  uses: AutomationDojo/github-custom-action-examples@v1.1.0
  with:
    github-token: ${{ secrets.GITHUB_TOKEN }}

- run: echo "Size: ${{ steps.check-size.outputs.size-label }}"
\`\`\`

Examples

You can see this action working on the repo:

Press enter or click to view image in full size

You can check the following pull request: https://github.com/AutomationDojo/github-custom-action-examples/pull/1

Best Practices I Learned

1. Always Bundle Your Code

Don’t rely on npm install during action execution. Bundle with ncc and commit dist/.

2. Use Semantic Versioning

Users should be able to pin to @v1 for automatic updates or @v1.1.0 for stability.

3. Validate Inputs Early

if (!context.payload.pull_request) {
  core.setFailed('This action only works on pull_request events');
  return;
}

4. Provide Useful Logging

core.info(`PR #${prNumber} has ${totalChanges} lines changed`);

Users can see this in their workflow logs for debugging.

5. Handle Errors Gracefully

try {
  // Create label
} catch (error) {
  if (error.status === 422) {
    core.info('Label already exists');
  } else {
    throw error;
  }
}

6. Make Everything Configurable

Don’t hardcode values. Use inputs with sensible defaults.

7. Test Locally First

Use uses: ./ in a workflow within your action's repository before publishing.

Common Pitfalls to Avoid

1. Forgetting to Build

Always run npm run build before committing. GitHub Actions runs dist/index.js, not your source code.

2. Not Committing dist/

The dist/ folder must be in your repository. Don't add it to .gitignore.

3. Wrong Node Version

Specify node20 in action.yml and use it consistently.

4. Missing Permissions

Ensure github-token has the required permissions. For most actions, ${{ secrets.GITHUB_TOKEN }} works fine.

5. Not Handling Edge Cases

What if the PR has 0 changes? What if labels already exist? Handle all scenarios.

Taking It Further

Now that you have a working action, consider:

Adding Tests:

npm install --save-dev jest @types/node

Local Testing with act:

brew install act
act pull_request

Multiple Actions in One Repo: Create subdirectories for different actions with their own action.yml files.

Publishing to Marketplace: Add topics to your repository and make it public. GitHub will automatically list it.

Real-World Impact

After implementing this action in my team:

Code reviews became 30% faster (small PRs are easier to review)
PR sizes decreased by 40% on average
Developers became more conscious of keeping changes focused
Onboarding new team members was easier (labels provide context)

Conclusion

Creating a GitHub Custom Action isn’t as daunting as it seems. With JavaScript and the GitHub Actions toolkit, you can automate almost any workflow task.

The key steps are:

Define your action’s metadata in action.yml
Write the logic using @actions/core and @actions/github
Bundle your code with @vercel/ncc
Test locally with uses: ./
Automate releases with Semantic Release
Document thoroughly

Start small, solve real problems, and iterate. Your future self (and your team) will thank you.

Resources

What automation challenges are you facing in your workflows? Share in the comments — I’d love to hear about them!

Managing GitHub Organizations with Terraform: From Manual Chaos to Infrastructure as Code

Rui Coelho — Sun, 25 Jan 2026 12:53:48 GMT

If you’ve ever managed a GitHub organization with more than a handful of repositories, you know the pain. Click here to add a branch protection rule. Click there to create a team. Navigate through five menus to grant repository access. Repeat. Repeat. Repeat.

Now imagine doing this for 50 repositories. Or 100. Or trying to maintain consistency across them all. Or worse — auditing who has access to what.

There’s a better way: Infrastructure as Code with Terraform.

The Problem with Manual GitHub Management

Manual GitHub organization management doesn’t scale. Here’s what typically happens:

Configuration Drift: Team A protects their main branch with certain rules. Team B uses different rules. Team C forgets to protect their branch at all.

Access Control Chaos: Someone needs access to five repositories. You grant it manually. Six months later, they leave the company. Did you remember to revoke access everywhere?

No Audit Trail: How do you know what changed, when, and by whom? GitHub’s audit log helps, but it doesn’t tell you the desired state of your infrastructure.

Documentation Debt: Your internal wiki has outdated screenshots of “how to configure a repository.” Reality diverged months ago.

Enter Terraform for GitHub

Terraform, the popular Infrastructure as Code tool, has excellent support for GitHub through the official GitHub provider. This means you can define your entire organization structure — repositories, teams, access controls, branch protection rules — in code.

The benefits are immediate:

✅ Version Control: Your GitHub configuration lives in Git (meta, right?)
✅ Code Review: Changes go through pull requests
✅ Auditability: Complete history of what changed and why
✅ Consistency: Define patterns once, apply everywhere
✅ Disaster Recovery: Your org structure is documented in code

A Practical Architecture

I’ve built a reference implementation that demonstrates how to structure Terraform for GitHub management: github-org-management-examples

The architecture uses four independent modules:

1. Organization Configuration

Manages org-level settings like billing email, member privileges, and default permissions. This is your organization’s “constitution” — the baseline rules everyone operates under.

resource "github_organization_settings" "this" {
  billing_email = var.billing_email

  members_can_create_repositories = true
  members_can_create_public_repositories = false
  members_can_create_private_repositories = true

  members_can_fork_private_repositories = false
}

2. Repository Management

Here’s where it gets interesting. Instead of hardcoding repositories in Terraform, define them in YAML:

repositories:
  example-with-ruleset:
    name: "example-with-ruleset"
    description: "Example repository with rulesets"
    visibility: "public"
    has_issues: true
    has_discussions: true
    has_projects: false
    delete_branch_on_merge: true
    topics:
      - "terraform"
      - "github"
      - "rulesets"
    vulnerability_alerts: true
    default_branch: "main"

    # Repository Rulesets (available for public repos on free tier)
    rulesets:
      main-protection:
        name: "Main Branch Protection"
        enforcement: "active"
        target: "branch"
        branch_patterns:
          - "~DEFAULT_BRANCH"  # Matches the default branch

        rules:
          creation: false
          update: true  # Require pull request
          deletion: true  # Block deletion
          required_linear_history: true
          non_fast_forward: true  # Prevent force pushes

          pull_request:
            required_approving_review_count: 1
            dismiss_stale_reviews_on_push: true
            require_code_owner_review: false
            required_review_thread_resolution: true

          required_status_checks:
            strict_required_status_checks_policy: true
            required_checks: []

The Terraform code reads this YAML and creates resources dynamically. This separation is crucial — developers can propose repository changes in YAML without touching Terraform logic.

3. Organization Rulesets

Organization-level rulesets (requires GitHub Team or Enterprise) let you enforce policies across all repositories. Think of it as a safety net — even if someone forgets to configure their repository properly, the org-level rules catch it.

Important: Repository-level rulesets work on the free tier for public repos. Organization-level rulesets require a paid plan but provide centralized enforcement.

4. Team Management

Teams and their repository access permissions, all in YAML:

teams:
  core-team:
    name: "Core Team"
    description: "Core maintainers with full access to the organization"
    privacy: "closed"
    members:
      - username: "alice"
        role: "maintainer"
      - username: "bob"
        role: "member"
    repositories:
      - repository: ".github"
        permission: "admin"

  external-access:
    name: "External Access"
    description: "External collaborators with read access to specific private repositories"
    privacy: "closed"
    members:
      - username: "external-contractor"
        role: "member"
    repositories:
      - repository: "private-repo"
        permission: "pull"  # Read-only access
      - repository: "another-private-repo"
        permission: "push"  # Write access

The Terraform module handles the complexity of creating teams, adding members, and granting repository access — all from this declarative configuration.

Real-World Workflow

Here’s how this looks in practice:

Scenario: New Repository

Developer opens a PR adding the repository to repositories.yaml
Team reviews the configuration (visibility, branch protection, etc.)
PR merges
GitHub Actions runs terraform apply
Repository is created with all protections in place

Scenario: Access Request

Engineer needs access to three repositories
PR adds them to the relevant team in teams.yaml
Security team reviews
Merge triggers Terraform
Access granted consistently across all repos

Scenario: Policy Update

Security requires all repos to enforce signed commits
Update the organization ruleset in org_rulesets.yaml
One PR, one review, one apply
Policy enforced across the entire organization

The YAML Strategy

Why YAML over pure Terraform? Several reasons:

Lower Barrier to Entry: Developers who don’t know Terraform can still propose repository changes. YAML is more approachable than HCL.

Separation of Concerns: Terraform handles the how (API calls, state management). YAML handles the what (desired configuration).

Validation: You can build additional tooling around YAML — linters, validators, custom checks — without modifying Terraform code.

Scalability: When you have 100+ repositories, managing them in YAML is far more maintainable than sprawling Terraform files.

Key Implementation Details

Using `try()` for Optional Fields

GitHub’s API has many optional parameters. The modules use try() extensively to provide sensible defaults:

resource "github_repository" "repos" {
  for_each = local.repositories

  name        = each.value.name
  description = try(each.value.description, null)
  visibility  = try(each.value.visibility, "private")

  # Features
  has_issues      = try(each.value.has_issues, true)
  has_discussions = try(each.value.has_discussions, false)
  has_projects    = try(each.value.has_projects, true)
  has_wiki        = try(each.value.has_wiki, true)

  # Merge settings
  allow_merge_commit     = try(each.value.allow_merge_commit, true)
  allow_squash_merge     = try(each.value.allow_squash_merge, true)
  delete_branch_on_merge = try(each.value.delete_branch_on_merge, true)

  # Other settings
  topics               = try(each.value.topics, [])
  vulnerability_alerts = try(each.value.vulnerability_alerts, true)
}

This pattern allows YAML configurations to be minimal — only specify what differs from defaults.

Dynamic Ruleset Generation

Repository rulesets can be defined inline with each repository. The locals.tf flattens this structure:

# locals.tf - Flatten rulesets from all repositories
locals {
  repo_rulesets = flatten([
    for repo_key, repo in local.repositories : [
      for ruleset_key, ruleset in try(repo.rulesets, {}) : {
        key         = "${repo_key}-${ruleset_key}"
        repo_key    = repo_key
        repo_name   = repo.name
        ruleset_key = ruleset_key
        ruleset     = ruleset
      }
    ]
  ])
}

# main.tf - Create rulesets dynamically
resource "github_repository_ruleset" "repo_rulesets" {
  for_each = {
    for rs in local.repo_rulesets : rs.key => rs
  }

  repository  = github_repository.repos[each.value.repo_key].name
  name        = each.value.ruleset.name
  target      = try(each.value.ruleset.target, "branch")
  enforcement = try(each.value.ruleset.enforcement, "active")

  conditions {
    ref_name {
      include = try(each.value.ruleset.branch_patterns, ["~DEFAULT_BRANCH"])
      exclude = try(each.value.ruleset.exclude_patterns, [])
    }
  }

  rules {
    creation                = try(each.value.ruleset.rules.creation, false)
    update                  = try(each.value.ruleset.rules.update, true)
    deletion                = try(each.value.ruleset.rules.deletion, true)
    required_linear_history = try(each.value.ruleset.rules.required_linear_history, false)
    non_fast_forward        = try(each.value.ruleset.rules.non_fast_forward, true)

    dynamic "pull_request" {
      for_each = try(each.value.ruleset.rules.pull_request, null) != null ? [1] : []
      content {
        required_approving_review_count   = try(each.value.ruleset.rules.pull_request.required_approving_review_count, 1)
        dismiss_stale_reviews_on_push     = try(each.value.ruleset.rules.pull_request.dismiss_stale_reviews_on_push, true)
        require_code_owner_review         = try(each.value.ruleset.rules.pull_request.require_code_owner_review, false)
      }
    }
  }
}

This creates rulesets only for repositories that define them, keeping the state clean and focused.

Flattened Team Repository Access

The team module uses a clever flattening technique to create individual access resources:

# locals.tf - Flatten team repositories
locals {
  team_repositories = flatten([
    for team_key, team in local.teams : [
      for repo in coalesce(try(team.repositories, null), []) : {
        team_key   = team_key
        repository = repo.repository
        permission = try(repo.permission, "pull")
      }
    ]
  ])
}

# locals.tf - Flatten team members
locals {
  team_members = flatten([
    for team_key, team in local.teams : [
      for member in coalesce(try(team.members, null), []) : {
        team_key = team_key
        username = member.username
        role     = try(member.role, "member")
      }
    ]
  ])
}

# main.tf - Create team repository access
resource "github_team_repository" "team_repos" {
  for_each = {
    for tr in local.team_repositories : "${tr.team_key}-${tr.repository}" => tr
  }

  team_id    = github_team.teams[each.value.team_key].id
  repository = each.value.repository
  permission = each.value.permission
}

# main.tf - Add team members
resource "github_team_membership" "members" {
  for_each = {
    for tm in local.team_members : "${tm.team_key}-${tm.username}" => tm
  }

  team_id  = github_team.teams[each.value.team_key].id
  username = each.value.username
  role     = each.value.role
}

This transforms the hierarchical YAML structure into the flat resource model Terraform needs.

Deployment Strategy

Each module is independent, allowing incremental adoption:

Start Small: Begin with organization settings
Add Repositories: Migrate existing repos to Terraform gradually
Implement Teams: Codify team structure and access
Enforce Policies: Layer in rulesets once the foundation is solid

Use separate state files for each module. This provides isolation — changes to teams don’t affect repository state.

Gotchas and Considerations

Authentication

You’ll need either a Personal Access Token (PAT) or GitHub App credentials. For production, use GitHub Apps with fine-grained permissions.

provider "github" {
  owner = var.github_organization
  token = var.github_token  # Better: use app authentication
}

State Management

Terraform state contains sensitive information. Use remote state (Terraform Cloud, S3 with encryption, etc.) and restrict access appropriately.

Import Existing Resources

Migrating existing infrastructure requires importing resources:

terraform import 'github_repository.this["backend-api"]' backend-api
terraform import 'github_team.this["backend-team"]' 12345678

Build an import script if you have many resources to migrate.

Plan Limitations

Organization rulesets require GitHub Team or Enterprise. Repository rulesets work on free tier for public repos. Plan accordingly based on your GitHub tier.

CI/CD Integration

Atlantis: The Recommended Approach

While GitHub Actions works well, Atlantis is often the better choice for Terraform automation — and arguably the recommended approach for managing infrastructure changes. Atlantis provides a GitOps workflow where Terraform runs are triggered and reviewed directly in pull requests, with built-in locking, approval workflows, and plan/apply separation.

The benefits of Atlantis include:

Pull request-native workflow — Plans and applies happen in PR comments
State locking — Prevents concurrent modifications
Approval gates — Require explicit approval before apply
Audit trail — Everything happens in GitHub, fully visible
Multi-environment support — Manage dev/staging/prod with different approval rules

For a production setup managing critical GitHub infrastructure, Atlantis provides the guardrails and visibility you need. The setup requires running an Atlantis server, but the operational benefits are well worth it.

GitHub Actions Approach

Automate Terraform runs with GitHub Actions:

name: Terraform Apply

on:
  push:
    branches: [main]
    paths:
      - 'repos/**'
      - 'teams/**'

jobs:
  terraform:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Setup Terraform
        uses: hashicorp/setup-terraform@v3

      - name: Terraform Init
        run: terraform init
        working-directory: ./repos

      - name: Terraform Apply
        run: terraform apply -auto-approve
        env:
          GITHUB_TOKEN: ${{ secrets.TF_GITHUB_TOKEN }}
        working-directory: ./repos

Add terraform plan on pull requests for preview:

on:
  pull_request:
    paths:
      - 'repos/**'

# ... terraform plan output as PR comment

Security Considerations

Least Privilege: Grant Terraform only the permissions it needs. Use GitHub Apps over PATs for fine-grained control.

Secret Scanning: Enable secret scanning on your Terraform repository. Never commit tokens.

State File Security: Terraform state contains sensitive data. Encrypt it at rest and in transit.

Review Process: Require multiple approvals for Terraform PRs. Organization changes should never be one person’s decision.

Drift Detection: Run terraform plan regularly to detect manual changes. Set up alerts if drift is detected.

Beyond the Basics

Once you have the foundation, you can extend it:

Custom Modules: Create organization-specific abstractions
Validation: Build custom validators for YAML configurations
Documentation Generation: Auto-generate docs from Terraform state
Compliance Reports: Generate access reports for audits
Batch Operations: Bulk update repositories using Terraform’s for_each

The Full Picture

Managing GitHub with Terraform isn’t just about automation — it’s about treating your organization structure as code. Version control, code review, automated testing, and deployment pipelines all apply.

The result is a GitHub organization that’s:

Consistent: Every repository follows the same standards
Auditable: Complete history of every change
Recoverable: Disaster recovery is just terraform apply
Scalable: Adding your 100th repository is as easy as the first
Secure: Policies are enforced automatically, not manually

Getting Started

Check out the complete implementation: github-org-management-examples

The repository includes:

Four modular Terraform configurations
YAML-based configuration examples
Detailed README with usage instructions
Documentation website: github-org-management-examples.automationdojo.org

Start with one module, validate the approach, then expand. Your future self — and your team — will thank you.

Have you managed GitHub organizations with Terraform? What challenges did you face? Share your experience in the comments.

Docker Hardened Images: Enterprise Security, Now Free for Everyone

Rui Coelho — Sun, 25 Jan 2026 12:44:58 GMT

How Docker’s security-focused container images went from premium to community-accessible.

When it comes to container security, the old saying “you don’t know what you don’t know” has never been more relevant. Every Docker image you pull could be hiding vulnerabilities, unnecessary packages, or worse — malicious code. This is where Docker Hardened Images come in, and as of December 2025, they’re available to everyone at no cost.

The Problem with Traditional Container Images

Most Docker images are built for convenience, not security. They come packed with shells, package managers, and tools that make development easy but also expand the attack surface dramatically. It’s like leaving all your house windows open because you might need the fresh air — convenient, yes, but hardly secure.

Consider this: a typical Node.js base image might contain hundreds of packages you’ll never use. Each one is a potential vulnerability. Each one could be the entry point for a supply chain attack. And with the rise of sophisticated attacks targeting containerized applications, this isn’t just theoretical risk — it’s a clear and present danger.

Enter Docker Hardened Images

Docker Hardened Images (DHI) take a radically different approach. Built on minimal Alpine or Debian Linux bases, these images strip away everything that isn’t absolutely necessary:

No shell — Can’t exploit what isn’t there
No package manager — Eliminates an entire class of attacks
Non-root user by default — Limits damage from compromises
Minimal dependencies — Only what your application actually needs

The result? Docker claims up to a 95% reduction in attack surface compared to standard images. That’s not a typo — ninety-five percent.

From Premium to Free: The Journey

Docker first introduced Hardened Images in May 2024 as a commercial offering. The value proposition was clear: pay for enterprise-grade security and compliance. Organizations with strict requirements — those needing FIPS compliance, DoD STIG standards, or contractual SLAs for vulnerability patching — found real value in the premium tier.

But Docker recognized a larger opportunity. Making basic hardened images free could help secure the entire container ecosystem, not just enterprises with deep pockets. As supply chain attacks become increasingly sophisticated, raising the security baseline for everyone benefits the entire community.

What’s Free, What’s Not

The newly free tier includes:

✅ Complete catalog of hardened base images
✅ Full SBOM (Software Bill of Materials) for each image
✅ CVE assessment and vulnerability data
✅ Apache 2.0 license with no hidden surprises
✅ Community support and GitHub-based catalog

The Enterprise tier (still paid) adds:

FIPS 140–2 and DoD STIG compliance variants
7-day critical CVE remediation SLA
Custom image building with full provenance
Enterprise support and contractual guarantees

This tiered approach allows Docker to sustain the project financially while democratizing container security fundamentals.

The Trade-offs You Should Know

Hardened images aren’t a drop-in replacement. The security benefits come with operational changes:

1. No Shell = Different Debugging

Without a shell, you can’t just docker exec into a container and poke around. Docker's solution is Docker Debug, a tool that provides debugging capabilities without modifying the hardened image. The catch? It requires Docker Desktop, which means a subscription for most business uses.

2. Package Installation Requires Workflow Changes

Need additional PHP extensions? You’ll use a -dev variant to install them, then copy the artifacts to your runtime image. It's more steps, but it enforces a clean separation between build-time and runtime dependencies.

3. Modifications Can Undermine Security

You can add anything to a hardened image — Docker won’t stop you. But every addition potentially reduces security. This is where scanners like Docker Scout, Trivy, or Grype become essential for verifying your final image maintains security standards.

Getting Started

Pulling a hardened image is straightforward:

docker pull dhi.io/node:20-alpine3.22

The full catalog is available on Docker Hub, with definitions and documentation on GitHub. The community is already actively requesting new images and variants.

The Community Response: Cautiously Optimistic

The developer community’s reaction has been positive but measured. On Hacker News, several developers pointed to Docker’s history of converting free offerings into paid subscriptions. Docker registries, Docker Desktop — both started free before requiring payment in business contexts.

Some expressed concern about long-term sustainability, drawing parallels to Bitnami’s recent shift from free public images to $50,000+ annual subscriptions following Broadcom’s VMware acquisition.

Docker’s response? The enterprise tier makes the free tier sustainable. Companies needing continuous patching, compliance certifications, and contractual SLAs generate revenue that supports free community access.

Is This the Right Move?

Time will tell if Docker’s strategy succeeds long-term, but the immediate impact is undeniable: container security best practices are now accessible to individual developers, startups, and small teams who couldn’t justify enterprise pricing.

The broader question isn’t whether to use hardened images — the security benefits are too significant to ignore. Rather, it’s about understanding the operational trade-offs and building workflows that embrace security-first principles without sacrificing development velocity.

Making the Switch

If you’re considering hardened images, start with these steps:

Audit your current images — Run a scanner like Docker Scout to understand your current vulnerability exposure
Start with one service — Don’t try to convert everything at once
Adapt your debugging workflow — Invest in Docker Debug or alternative tools early
Automate scanning — Make vulnerability scanning part of your CI/CD pipeline
Document the differences — Your team needs to understand the constraints and workflows

The Bigger Picture

Docker Hardened Images represent a maturation of container security. We’re moving beyond “shift left” buzzwords toward practical, opinionated solutions that make secure defaults easy to adopt.

Whether this particular offering remains free indefinitely is secondary to the broader shift: security is becoming less of a premium feature and more of a baseline expectation. And that’s something worth celebrating.

The Docker Hardened Images catalog is available at https://github.com/docker-hardened-images/catalog. Enterprise information is available through Docker’s sales team.

What’s your experience with container security? Share your thoughts in the comments below.

Kubernetes v1.35 (Timbernetes): What’s New and What’s Changing

Rui Coelho — Sun, 25 Jan 2026 12:43:08 GMT

The Kubernetes project has just released version 1.35 on December 17, 2025, bringing significant enhancements, important deprecations, and a continued focus on stability and enterprise readiness. After 58 enhancements across the v1.34 release cycle, the community continues to push the boundaries of container orchestration while maintaining the platform’s reliability and production-grade quality.

This release represents another milestone in Kubernetes’ evolution, with critical features graduating to general availability and legacy components being phased out to reduce technical debt. Let’s dive deep into what v1.35 brings to the table, with practical examples of how these changes will impact your day-to-day operations.

Game-Changing Features

1. In-Place Pod Resource Updates: Finally GA!

After years of development and testing (alpha in v1.27, beta in v1.33), the ability to update Pod resources without restarting containers is finally graduating to General Availability in v1.35. This is arguably one of the most requested features in Kubernetes history.

The Problem It Solves

Previously, if you needed to adjust CPU or memory allocations for a running Pod, your only option was to delete and recreate it. This caused disruption to:

Stateful applications that maintain long-lived connections
Machine learning training jobs that couldn’t checkpoint their state
Database replicas that needed to resynchronize data
Any workload where downtime equals lost revenue

How It Works

The feature allows you to modify the resources.requests and resources.limits for containers in a running Pod. Here's a practical example:

# Original Pod specification
apiVersion: v1
kind: Pod
metadata:
  name: my-app
spec:
  containers:
  - name: app
    image: my-app:1.0
    resources:
      requests:
        memory: "512Mi"
        cpu: "500m"
      limits:
        memory: "1Gi"
        cpu: "1000m"

Now, you can update it in place:

kubectl patch pod my-app --type='json' -p='[
  {
    "op": "replace",
    "path": "/spec/containers/0/resources/requests/memory",
    "value": "1Gi"
  },
  {
    "op": "replace",
    "path": "/spec/containers/0/resources/limits/memory",
    "value": "2Gi"
  }
]'

The container keeps running with its new resource allocation. No restart, no data loss, no service interruption.

Real-World Use Case: E-commerce Flash Sale

Imagine you’re running an e-commerce platform. During normal operations, your checkout service runs with 2GB of memory. But you’ve scheduled a flash sale, and you know traffic will spike 10x.

Before v1.35, you had two bad options:

Over-provision all the time (expensive)
Scale out more replicas and accept some disruption during Pod replacements

With in-place updates:

# Before the flash sale
kubectl patch deployment checkout --type='json' -p='[
  {
    "op": "replace",
    "path": "/spec/template/spec/containers/0/resources/limits/memory",
    "value": "8Gi"
  }
]'

# After the flash sale
kubectl patch deployment checkout --type='json' -p='[
  {
    "op": "replace",
    "path": "/spec/template/spec/containers/0/resources/limits/memory",
    "value": "2Gi"
  }
]'

Your Pods scale vertically without any restart, maintaining all active shopping carts and sessions.

Integration with VPA

Combined with the Vertical Pod Autoscaler, this enables truly dynamic resource optimization:

apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: my-app-vpa
spec:
  targetRef:
    apiVersion: "apps/v1"
    kind: Deployment
    name: my-app
  updatePolicy:
    updateMode: "Auto"  # Now uses in-place updates!
  resourcePolicy:
    containerPolicies:
    - containerName: app
      minAllowed:
        cpu: 100m
        memory: 256Mi
      maxAllowed:
        cpu: 2
        memory: 4Gi

The VPA can now adjust resources without pod disruption, learning from actual usage patterns and optimizing costs automatically.

2. Node Declared Features: Solving the Version Skew Problem

One of the most challenging aspects of managing Kubernetes clusters at scale is handling version skew during upgrades. You upgrade your control plane to v1.35, but some worker nodes are still on v1.34 or even v1.33. What happens when you schedule a Pod that uses a v1.35 feature on a v1.34 node? Typically: failure.

The Traditional Approach (Manual Labels)

Until now, the solution was manual node labeling:

# Manually label nodes that support new features
kubectl label node worker-1 feature.kubernetes.io/in-place-resize=true
kubectl label node worker-2 feature.kubernetes.io/pod-certificates=true

# Then use node selectors
apiVersion: v1
kind: Pod
metadata:
  name: my-app
spec:
  nodeSelector:
    feature.kubernetes.io/in-place-resize: "true"
  containers:
  - name: app
    image: my-app:1.0

This is error-prone, doesn’t scale, and creates operational overhead.

The v1.35 Solution: Automatic Feature Declaration

With Node Declared Features (alpha), nodes automatically report their capabilities:

kubectl get node worker-1 -o jsonpath='{.status.declaredFeatures}'

Output:

{
  "InPlacePodVerticalScaling": {
    "supported": true,
    "kubernetesVersion": "1.35.0"
  },
  "PodCertificates": {
    "supported": true,
    "kubernetesVersion": "1.35.0"
  },
  "SidecarContainers": {
    "supported": false,
    "reason": "Kubelet version 1.34"
  }
}

The scheduler uses this information automatically:

apiVersion: v1
kind: Pod
metadata:
  name: my-app
  annotations:
    scheduling.kubernetes.io/required-features: "InPlacePodVerticalScaling,PodCertificates"
spec:
  containers:
  - name: app
    image: my-app:1.0

The scheduler ensures this Pod lands on a node that supports both features. No manual labeling required.

Real-World Scenario: Rolling Cluster Upgrade

You’re upgrading a 100-node cluster from v1.34 to v1.35. With traditional approaches, you might:

Upgrade all nodes and accept the disruption
Create two separate node pools and migrate workloads
Hope nothing breaks during the gradual rollout

With Node Declared Features:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: critical-app
spec:
  replicas: 10
  template:
    metadata:
      annotations:
        scheduling.kubernetes.io/required-features: "InPlacePodVerticalScaling"
    spec:
      containers:
      - name: app
        image: critical-app:2.0

This Deployment automatically schedules only on upgraded nodes, while your other workloads continue running on v1.34 nodes. The upgrade becomes safer and more manageable.

3. Pod Certificates: Native mTLS Identity

Microservices security often requires mutual TLS (mTLS) for service-to-service authentication. Until now, implementing this meant:

Installing SPIFFE/SPIRE (complex)
Using cert-manager with custom automation (brittle)
Integrating with service meshes like Istio (heavyweight)

The Native Kubernetes Solution

Pod Certificates (graduating to beta in v1.35) provides built-in certificate management for workloads:

apiVersion: v1
kind: Pod
metadata:
  name: frontend
spec:
  containers:
  - name: app
    image: frontend:1.0
    volumeMounts:
    - name: certs
      mountPath: /etc/certs
      readOnly: true
  volumes:
  - name: certs
    projected:
      sources:
      - serviceAccountToken:
          path: token
          expirationSeconds: 3600
      - certificate:
          name: pod-cert
          issuerRef:
            name: cluster-ca
          dnsNames:
          - frontend.default.svc.cluster.local
          - "*.frontend.default.svc.cluster.local"

The kubelet automatically:

Requests a certificate from the specified issuer
Mounts it in the Pod at /etc/certs/tls.crt
Rotates it before expiration
Includes the private key at /etc/certs/tls.key

Example: Securing Database Connections

Here’s how you’d configure a PostgreSQL client to use Pod Certificates:

apiVersion: v1
kind: Pod
metadata:
  name: api-server
spec:
  containers:
  - name: api
    image: my-api:1.0
    env:
    - name: DB_HOST
      value: postgres.default.svc.cluster.local
    - name: DB_SSLMODE
      value: verify-full
    - name: DB_SSLCERT
      value: /etc/certs/tls.crt
    - name: DB_SSLKEY
      value: /etc/certs/tls.key
    - name: DB_SSLROOTCERT
      value: /etc/certs/ca.crt
    volumeMounts:
    - name: certs
      mountPath: /etc/certs
      readOnly: true
  volumes:
  - name: certs
    projected:
      sources:
      - certificate:
          name: api-client-cert
          issuerRef:
            name: db-ca
          dnsNames:
          - api-server.default.svc.cluster.local

Your application code doesn’t need to know about certificate rotation or management — it just reads from the mounted path, which Kubernetes keeps up-to-date automatically.

Comparison with Traditional Approaches

4. Numeric Taints and Tolerations: Precision Scheduling

The taints and tolerations system is getting a significant upgrade with numeric comparison operators. This might sound like a small change, but it unlocks powerful new scheduling patterns.

The Old Way: Binary Decisions

Previously, you could only express “has this taint” or “doesn’t have this taint”:

# Taint a node
kubectl taint nodes gpu-node-1 gpu=nvidia:NoSchedule

# Tolerate it
apiVersion: v1
kind: Pod
metadata:
  name: ml-training
spec:
  tolerations:
  - key: gpu
    operator: Equal
    value: nvidia
    effect: NoSchedule

This works for binary decisions, but what about expressing thresholds?

The New Way: Numeric Comparisons

Now you can use Gt (greater than) and Lt (less than):

# Taint nodes with their GPU memory
kubectl taint nodes gpu-node-1 gpu-memory=8:NoSchedule
kubectl taint nodes gpu-node-2 gpu-memory=16:NoSchedule
kubectl taint nodes gpu-node-3 gpu-memory=32:NoSchedule

# Schedule only on nodes with at least 16GB GPU memory
apiVersion: v1
kind: Pod
metadata:
  name: large-model-training
spec:
  tolerations:
  - key: gpu-memory
    operator: Gt
    value: "15"
    effect: NoSchedule

Real-World Use Case: Network Bandwidth Requirements

Imagine you’re running a video streaming service. Some workloads need high bandwidth:

# Label nodes with their network bandwidth (in Gbps)
kubectl taint nodes worker-1 network-bandwidth=1:NoSchedule
kubectl taint nodes worker-2 network-bandwidth=10:NoSchedule
kubectl taint nodes worker-3 network-bandwidth=25:NoSchedule

# 4K streaming needs at least 10Gbps
apiVersion: apps/v1
kind: Deployment
metadata:
  name: stream-4k
spec:
  template:
    spec:
      tolerations:
      - key: network-bandwidth
        operator: Gt
        value: "9"
        effect: NoSchedule
      containers:
      - name: streamer
        image: video-streamer:4k

# Standard definition works on any node
apiVersion: apps/v1
kind: Deployment
metadata:
  name: stream-sd
spec:
  template:
    spec:
      tolerations:
      - key: network-bandwidth
        operator: Gt
        value: "0"
        effect: NoSchedule
      containers:
      - name: streamer
        image: video-streamer:sd

SLA-Based Scheduling

Another powerful use case is SLA-based scheduling:

# Nodes with different reliability SLAs
kubectl taint nodes worker-spot-1 availability-sla=95:NoSchedule
kubectl taint nodes worker-ondemand-1 availability-sla=99.9:NoSchedule
kubectl taint nodes worker-reserved-1 availability-sla=99.99:NoSchedule

# Critical workload requires 99.9% uptime
apiVersion: v1
kind: Pod
metadata:
  name: payment-processor
spec:
  tolerations:
  - key: availability-sla
    operator: Gt
    value: "99.5"
    effect: NoSchedule
  containers:
  - name: processor
    image: payment-processor:1.0

Critical Deprecations: What You Need to Do

1. Farewell to cgroup v1

Kubernetes v1.35 drops support for cgroup v1 on Linux nodes. This is a significant change that requires action.

Check Your Nodes

# Check if your nodes are using cgroup v2
for node in $(kubectl get nodes -o name); do
  echo "Checking $node"
  kubectl debug $node -it --image=ubuntu -- sh -c '
    if [ -d /sys/fs/cgroup/unified ]; then
      echo "Using cgroup v2 ✓"
    else
      echo "Using cgroup v1 ✗ - REQUIRES MIGRATION"
    fi
  '
done

Migration Path

If you find nodes using cgroup v1:

For Ubuntu/Debian:

# Enable cgroup v2
sudo sed -i 's/GRUB_CMDLINE_LINUX=""/GRUB_CMDLINE_LINUX="systemd.unified_cgroup_hierarchy=1"/' /etc/default/grub
sudo update-grub
sudo reboot

For RHEL/CentOS:

# Enable cgroup v2
sudo grubby --update-kernel=ALL --args="systemd.unified_cgroup_hierarchy=1"
sudo reboot

Verify after reboot:

mount | grep cgroup2
# Should show: cgroup2 on /sys/fs/cgroup type cgroup2

2. Migrating from IPVS to nftables

If you’re using IPVS mode in kube-proxy, it’s time to migrate to nftables.

Check Current Mode

kubectl get configmap kube-proxy -n kube-system -o yaml | grep mode

Migration Steps

Update kube-proxy ConfigMap:

apiVersion: v1
kind: ConfigMap
metadata:
  name: kube-proxy
  namespace: kube-system
data:
  config.conf: |
    apiVersion: kubeproxy.config.k8s.io/v1alpha1
    kind: KubeProxyConfiguration
    mode: "nftables"  # Changed from "ipvs"
    # ... rest of config

Restart kube-proxy:

kubectl rollout restart daemonset kube-proxy -n kube-system

Verify the change:

kubectl logs -n kube-system -l k8s-app=kube-proxy | grep "Using nftables"

Performance Comparison

In benchmarks, nftables shows:

30% better throughput for new connection establishment
50% lower memory usage for large services (>10,000 endpoints)
Better integration with modern Linux kernels

3. Containerd v2 Upgrade

Check your containerd version:

kubectl get nodes -o custom-columns=NAME:.metadata.name,CONTAINER-RUNTIME:.status.nodeInfo.containerRuntimeVersion

If you see containerd 1.x, upgrade to containerd 2.0 or later:

# For Ubuntu/Debian
sudo apt-get update
sudo apt-get install containerd.io=2.0.*

# For RHEL/CentOS
sudo yum update containerd.io-2.0.*

# Restart containerd
sudo systemctl restart containerd

Monitor with Prometheus

Add this alert to catch unsupported versions:

apiVersion: v1
kind: PrometheusRule
metadata:
  name: containerd-version-alert
spec:
  groups:
  - name: containerd
    rules:
    - alert: ContainerdVersionUnsupported
      expr: kubelet_cri_losing_support == 1
      for: 24h
      annotations:
        summary: "Node {{ $labels.node }} running unsupported containerd version"
        description: "Upgrade to containerd 2.0+ - v1.35 is the last version supporting containerd 1.x"

Preparing for the Upgrade

Pre-Upgrade Checklist

Audit Your Infrastructure

# Create a pre-upgrade report
kubectl get nodes -o json | jq '.items[] | {
  name: .metadata.name,
  kubelet: .status.nodeInfo.kubeletVersion,
  container_runtime: .status.nodeInfo.containerRuntimeVersion,
  os: .status.nodeInfo.osImage
}' > pre-upgrade-report.json

Test in Staging

# Deploy v1.35 to a test cluster
kind create cluster --name k8s-135-test --image kindest/node:v1.35.0

# Run your test suite
kubectl apply -f test-workloads/

Backup everything

# Backup all resources
kubectl get all --all-namespaces -o yaml > backup-all-resources.yaml

# Backup etcd
ETCDCTL_API=3 etcdctl snapshot save snapshot.db

Rolling Upgrade Strategy

# Use PodDisruptionBudgets to control disruption
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: critical-app-pdb
spec:
  minAvailable: 80%
  selector:
    matchLabels:
      app: critical-app

---
# Drain nodes carefully
#!/bin/bash
for node in $(kubectl get nodes -o name); do
  echo "Upgrading $node"

  # Cordon the node
  kubectl cordon $node

  # Drain with grace period
  kubectl drain $node --ignore-daemonsets --delete-emptydir-data --grace-period=300

  # Upgrade the node (method depends on your setup)
  # ... node upgrade commands ...

  # Uncordon when ready
  kubectl uncordon $node

  # Wait for node to be ready
  kubectl wait --for=condition=Ready $node --timeout=600s

  # Health check
  sleep 60
done

What This Means for Different Teams

For Platform Engineers

Immediate Actions:

Test in-place resource updates with your VPA setup
Evaluate Pod Certificates for replacing external cert management
Plan cgroup v2 migration for all nodes

Opportunities:

Reduce costs by 20–30% through better resource optimization
Simplify security architecture with native workload identity
Improve upgrade procedures with node declared features

For Security Teams

New Capabilities:

Native workload identity reduces external dependencies
Better audit trails with automatic certificate rotation logs
Improved compliance through standardized mTLS

Security Considerations:

Review certificate issuer configurations
Audit node feature declarations for security-sensitive workloads
Update security policies to leverage numeric taints for compliance zones

For SREs and Operators

Operational Improvements:

Vertical scaling without disruption reduces incident response time
Safer cluster upgrades with automatic feature detection
Better capacity planning with numeric taints

Monitoring:

# Add Prometheus rules for new features
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: k8s-135-monitoring
spec:
  groups:
  - name: k8s-135-features
    rules:
    - record: kubelet:pod_resource_update_total
      expr: sum(rate(kubelet_pod_resource_update_total[5m])) by (node)

    - record: kubelet:declared_features
      expr: count(kubelet_declared_features) by (feature)

    - alert: NodeMissingRequiredFeature
      expr: kube_pod_status_scheduled{node!=""} 
        and on(node) kubelet_declared_features{feature="InPlacePodVerticalScaling"} == 0
      annotations:
        summary: "Pod scheduled on node without required features"

Conclusion

Kubernetes v1.35 represents a significant leap forward in production-readiness and operational excellence. The graduation of in-place Pod resource updates to GA alone is a game-changer for cost optimization and service reliability. Combined with native workload identity through Pod Certificates and smarter scheduling with Node Declared Features, this release provides the tools needed to run modern applications at scale with confidence.

The deprecations, while requiring some work, eliminate technical debt and point toward a more maintainable future. The removal of cgroup v1 support, IPVS mode deprecation, and containerd v1 sunset are all steps toward a cleaner, more efficient Kubernetes codebase.

As you prepare for the December 17th release, use the examples and migration guides in this article to plan your upgrade strategy. Test thoroughly in staging, monitor the new metrics, and take advantage of the 14-month support window to migrate at your own pace.

The Kubernetes community continues to demonstrate its commitment to both innovation and stability — a rare combination in the fast-moving cloud native ecosystem. Version 1.35 is proof that mature open source projects can deliver cutting-edge features while maintaining backward compatibility and providing clear upgrade paths.

Resources:

Official Kubernetes 1.35 Release Notes
KEP-1287: In-Place Update of Pod Resources
KEP-5328: Node Declared Features
KEP-4317: Pod Certificates
Migration Guides: kubernetes.io/docs/tasks/administer-cluster

Ready to upgrade? Join the conversation in the Kubernetes Slack #release-management channel and share your v1.35 experiences with the community.

Helm Best Practices 2025: What Changed with Helm 4 and What You Should Know

Rui Coelho — Tue, 18 Nov 2025 17:41:57 GMT

Helm 4.0 just dropped at KubeCon — here’s everything DevOps engineers need to know about the biggest changes in 6 years.

After six years since Helm 3, the Kubernetes package manager just got its biggest update. Helm 4.0 was released at KubeCon North America 2025 (November 10–13), bringing significant architectural changes, new features, and updated best practices that every DevOps engineer needs to understand.

If you’re managing Helm charts in production, this isn’t just another minor update — it’s a fundamental shift in how Helm handles deployments. But don’t panic: the team has focused heavily on maintaining chart compatibility while modernizing the underlying architecture.

In this guide, I’ll walk you through what’s new, what’s changed, the best practices you need to adopt, and how to migrate safely.

What’s New in Helm 4.0

The Big Changes

Helm 4.0 represents the first major version bump since 2019. Here are the headline features:

1. Server-Side Apply (SSA) is Now Default

The biggest architectural change: Helm 4 ditches the three-way merge strategy in favor of Server-Side Apply, the same approach Kubernetes itself uses.

What this means:

Better conflict detection and handling
Clearer ownership of fields
More predictable upgrade behavior
Explicit errors instead of silent overwrites

2. Completely Redesigned Plugin System

The plugin architecture got a complete overhaul with support for:

CLI plugins (command extensions)
Getter plugins (custom download protocols)
Post-renderer plugins (template modifications)
WebAssembly (WASM) runtime for cross-platform plugins

3. Advanced Resource Status Monitoring

Helm now uses kstatus for intelligent resource watching:

Waits for actual readiness, not just pod creation
Better understanding of resource conditions
Smarter timeout handling
Improved debugging when things fail

4. OCI Install by Digest

Enhanced OCI registry support with digest-based installs for:

Immutable chart references
Better supply chain security
Precise version control

5. Chart v3 Support

New chart API version with:

Backwards compatibility for v2 charts
Better dependency management
Enhanced metadata support

Breaking Changes You Need to Know

1. Plugin Migration Required

All existing plugins must be updated to work with Helm 4. The HIP-0026 plugin redesign means:

# Old plugin structure (Helm 3)
my-plugin/
  ├── plugin.yaml
  └── plugin.sh

# New plugin structure (Helm 4)
my-plugin/
  ├── plugin.yaml
  ├── main.wasm (or binary)
  └── metadata.json

Action required:

Audit your plugin usage: helm plugin list
Check with plugin maintainers for Helm 4 compatibility
Test plugins in staging before upgrading production

2. CLI Flag Renaming

Several CLI flags have been renamed for consistency:

# Helm 3
helm install --wait-for-jobs

# Helm 4
helm install --wait

Action required:

Update CI/CD scripts
Search codebase for hardcoded Helm commands
Update documentation

3. Package Restructuring (SDK Users)

If you’re using Helm as a Go library, packages have been reorganized:

// Helm 3
import "helm.sh/helm/v3/pkg/chart"

// Helm 4
import "helm.sh/helm/v4/pkg/chart/v2"

Action required:

Update import paths
Test integrations thoroughly
Review API changes in documentation

4. Server-Side Apply Conflicts

With SSA as default, conflicts are now explicit errors rather than silent overwrites:

# This will now error if another controller owns the field
helm upgrade my-app ./chart

# Use --force-conflicts to override (use carefully!)
helm upgrade my-app ./chart --force-conflicts

Important limitations:

❌ Multiple owners per manifest not supported
❌ Field ownership transfer not supported
✅ Backwards compatible with three-way merge charts (if K8s >= 1.22)

Helm 4 Best Practices: Updated for 2025

1. Embrace Server-Side Apply

Why it matters: SSA provides clearer semantics and better conflict handling.

Best Practice:

# In your chart values, be explicit about ownership
apiVersion: v1
kind: ConfigMap
metadata:
  name: my-config
  annotations:
    # Document who should own this resource
    meta.helm.sh/owner: "my-team"
data:
  config.yaml: |
    setting: value

What to avoid:

Don’t rely on undocumented merge behavior
SSA is more strict about field ownership

Migration tip: Test upgrades in staging with --dry-run first to catch conflicts early.

2. Use Digest-Based OCI Installs

Why it matters: Ensures immutable deployments and better security.

Best Practice:

# Pin to specific digest, not tag
helm install my-app oci://registry.example.com/charts/my-app@sha256:abc123...

# Avoid mutable tags in production
# BAD: helm install my-app oci://registry.example.com/charts/my-app:latest

In your CI/CD:

# GitHub Actions example
- name: Install chart by digest
  run: |
    DIGEST=$(helm show chart oci://registry/chart:${{ github.sha }} --output json | jq -r '.digest')
    helm install app oci://registry/chart@$DIGEST

3. Leverage Advanced Status Monitoring

Why it matters: Helm 4’s kstatus understands actual readiness.

Best Practice:

# Wait for real readiness, not just pod creation
helm install my-app ./chart \
  --wait \
  --timeout 10m

# Use specific status checks
helm install my-app ./chart \
  --wait \
  --wait-for-jobs \
  --atomic  # Rollback on failure

In your charts:

# Define readiness properly
apiVersion: apps/v1
kind: Deployment
spec:
  template:
    spec:
      containers:
      - name: app
        readinessProbe:
          httpGet:
            path: /health
            port: 8080
          initialDelaySeconds: 10
          periodSeconds: 5
        livenessProbe:
          httpGet:
            path: /health
            port: 8080
          initialDelaySeconds: 30
          periodSeconds: 10

4. Chart v3: Structure Your Charts Properly

Why it matters: Better organization and maintainability.

Best Practice:

my-chart/
├── Chart.yaml          # Use apiVersion: v3
├── values.yaml
├── values.schema.json  # JSON Schema validation
├── templates/
│   ├── _helpers.tpl    # Template functions
│   ├── deployment.yaml
│   ├── service.yaml
│   └── NOTES.txt       # User guidance
├── charts/             # Dependencies
└── crds/              # Custom Resource Definitions

Chart.yaml v3:

apiVersion: v3  # New in Helm 4
name: my-app
version: 1.0.0
dependencies:
  - name: postgresql
    version: "^12.0.0"
    repository: "https://charts.bitnami.com/bitnami"
    condition: postgresql.enabled

5. Use Multi-Document Values Files

Why it matters: Better organization of complex configurations.

Best Practice:

# values.yaml can now contain multiple documents
---
# Global configuration
global:
  domain: example.com
---
# Environment-specific overrides
env:
  production:
    replicas: 3
  staging:
    replicas: 1

Install with specific environment:

helm install my-app ./chart \
  --values values.yaml \
  --set env=production

6. Implement Proper Secret Management

Why it matters: Security is non-negotiable.

Best Practice — Use External Secrets Operator:

# Don't put secrets in values.yaml
# Use External Secrets Operator instead
apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
  name: app-secrets
spec:
  refreshInterval: 1h
  secretStoreRef:
    name: vault-backend
    kind: SecretStore
  target:
    name: app-secrets
  data:
  - secretKey: db-password
    remoteRef:
      key: /secret/data/app
      property: db_password

Or use Sealed Secrets:

# Encrypt secrets before committing
kubeseal --format yaml < secret.yaml > sealed-secret.yaml

# Include sealed-secret.yaml in your chart

What to avoid:

# NEVER do this
apiVersion: v1
kind: Secret
data:
  password: cGFzc3dvcmQxMjM=  # Base64 is not encryption!

7. Validate Charts Before Deployment

Why it matters: Catch errors before they hit production.

Best Practice:

# 1. Lint the chart
helm lint ./my-chart

# 2. Template and validate
helm template my-app ./my-chart \
  --values values-prod.yaml \
  --validate

# 3. Dry-run install
helm install my-app ./my-chart \
  --values values-prod.yaml \
  --dry-run \
  --debug

# 4. Use external validators
helm plugin install https://github.com/instrumenta/helm-kubeval
helm kubeval ./my-chart

In your CI/CD:

# GitHub Actions example
- name: Validate Helm Chart
  run: |
    helm lint charts/*
    helm template test charts/my-app | kubeval --strict

8. Version Control Your Chart Dependencies

Why it matters: Reproducible builds.

Best Practice:

# Chart.yaml - Pin dependency versions
dependencies:
  - name: redis
    version: "17.11.3"  # Exact version, not range
    repository: "https://charts.bitnami.com/bitnami"

Lock dependencies:

# Generate Chart.lock
helm dependency update ./my-chart

# Commit Chart.lock to version control
git add Chart.lock
git commit -m "Lock chart dependencies"

What to avoid:

# Don't use version ranges in production
dependencies:
  - name: redis
    version: "^17.0.0"  # Could pull 17.11.x unexpectedly

9. Use Helm Test for Validation

Why it matters: Verify deployments actually work.

Best Practice:

# templates/tests/connection-test.yaml
apiVersion: v1
kind: Pod
metadata:
  name: "{{ include "my-app.fullname" . }}-test-connection"
  annotations:
    "helm.sh/hook": test
spec:
  containers:
  - name: wget
    image: busybox
    command: ['wget']
    args: ['{{ include "my-app.fullname" . }}:{{ .Values.service.port }}']
  restartPolicy: Never

Run tests:

# After installation
helm test my-app

# With verbose output
helm test my-app --logs

10. Document Your Charts Properly

Why it matters: Future you (and your team) will thank you.

Best Practice:

# Chart.yaml - Complete metadata
name: my-app
description: A production-ready application chart
type: application
version: 1.0.0
appVersion: "2.3.0"
keywords:
  - web
  - api
  - production
home: https://github.com/myorg/my-app
sources:
  - https://github.com/myorg/my-app
maintainers:
  - name: Your Name
    email: your.email@example.com

README.md template:

# My App Helm Chart

## Prerequisites
- Kubernetes 1.27+
- Helm 4.0+

## Installation
\`\`\`bash
helm install my-app ./my-app
\`\`\`

## Configuration
| Parameter | Description | Default |
|-----------|-------------|---------|
| `replicaCount` | Number of replicas | `1` |
| `image.repository` | Image repository | `myapp` |

## Examples
### Development
\`\`\`bash
helm install my-app ./my-app -f values-dev.yaml
\`\`\`
### Production
\`\`\`bash
helm install my-app ./my-app -f values-prod.yaml
\`\`\`

Migration Guide: Helm 3 to Helm 4

Step 1: Prepare Your Environment

# Install Helm 4 alongside Helm 3
curl https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-4 | bash

# Verify installation
helm version

# Check existing releases
helm list --all-namespaces

Step 2: Test Charts in Staging

# Template your charts with Helm 4
helm template my-app ./chart \
  --values values-staging.yaml \
  --debug

# Dry-run upgrade
helm upgrade my-app ./chart \
  --values values-staging.yaml \
  --dry-run \
  --debug

Step 3: Check for Conflicts

# Upgrade with conflict detection
helm upgrade my-app ./chart \
  --values values-staging.yaml

# If conflicts occur, inspect them
kubectl get   -o yaml --show-managed-fields

# Force if necessary (carefully!)
helm upgrade my-app ./chart \
  --force-conflicts

Step 4: Update Plugins

# List current plugins
helm plugin list

# Check for Helm 4 compatibility
# Visit plugin repos for updates
# Update plugins
helm plugin update

Step 5: Update CI/CD Pipelines

# Before (Helm 3)
- name: Deploy with Helm
  run: |
    helm upgrade --install my-app ./chart \
      --wait-for-jobs \
      --timeout 5m

# After (Helm 4)
- name: Deploy with Helm
  run: |
    helm upgrade --install my-app ./chart \
      --wait \
      --timeout 5m

Step 6: Monitor the Migration

# Check release history
helm history my-app

# Verify resources
kubectl get all -l app=my-app
# Check for SSA annotations
kubectl get deployment my-app -o yaml | grep -A 5 "managedFields"

Troubleshooting Common Issues

Issue 1: Conflict Errors After Upgrade

Symptom:

Error: UPGRADE FAILED: another controller owns this field

Solution:

# Option 1: Use --force-conflicts (understand implications first!)
helm upgrade my-app ./chart --force-conflicts

# Option 2: Identify and remove conflicting controller
kubectl get   -o yaml --show-managed-fields

# Option 3: Revert to three-way merge temporarily
helm upgrade my-app ./chart --three-way-merge

Issue 2: Plugin Not Working

Symptom:

Error: plugin "xyz" failed: exec format error

Solution:

# Check plugin compatibility
helm plugin list

# Update to Helm 4 compatible version
helm plugin update xyz

# Or install WASM version if available
helm plugin install https://github.com/author/plugin-wasm

Issue 3: Chart Templates Failing

Symptom:

Error: template: chart/templates/deployment.yaml: undefined variable

Solution:

# Validate with debug output
helm template my-app ./chart --debug

# Check for deprecated functions
# Some template functions may have changed
# Update to Chart API v3 if needed
# Edit Chart.yaml: apiVersion: v3

Issue 4: OCI Registry Authentication Fails

Symptom:

Error: failed to authorize: failed to fetch anonymous token

Solution:

# Login to registry
helm registry login registry.example.com \
  --username your-user

# Or use credential helper
export HELM_REGISTRY_CONFIG=/path/to/config.json

# Verify
helm pull oci://registry.example.com/chart

Performance Improvements in Helm 4

Helm 4 isn’t just about features — it’s also faster:

Benchmarks (approximate):

Chart installation: ~15% faster
Template rendering: ~20% faster for complex charts
Dependency resolution: ~30% faster with content-based caching

What makes it faster:

Content-based caching for charts
Optimized dependency resolution
Parallel resource watching
Better memory management

What’s Coming Next

Helm 4’s release schedule:

Helm 4.0.0: November 2025 (KubeCon NA)
Helm 4.1.0: January 2026
Minor releases: Every 4 months

Helm 3 End of Life:

Helm 3 will reach EOL approximately 6–8 months after Helm 4 release
Estimated: July 2026
Action: Plan your migration accordingly

Conclusion

Helm 4.0 represents a significant evolution while maintaining the backwards compatibility that makes migration manageable. The shift to Server-Side Apply, redesigned plugin system, and enhanced status monitoring make Helm more robust and production-ready.

Key Takeaways:

✅ Server-Side Apply is the new default — Better conflict handling
✅ Plugin system redesigned — WASM support, better security
✅ Advanced status monitoring — True readiness detection
✅ OCI improvements — Digest-based installs for security
✅ Chart v3 support — Better dependency management

Action Items:

Immediate:

Test your charts with Helm 4 in staging
Audit plugin usage
Update CI/CD scripts for renamed flags

2. Short-term (1–2 months):

Migrate production deployments
Update chart documentation
Train team on new features

3. Long-term (3–6 months):

Adopt SSA best practices fully
Migrate to Chart v3
Update internal tooling and scripts

Getting Started:

# Install Helm 4
curl https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-4 | bash

# Test your first chart
helm template my-app ./chart --debug

# Deploy when ready
helm upgrade --install my-app ./chart --wait

Helm 4 is production-ready and brings meaningful improvements. Start testing today, and plan your migration timeline. The Helm community has done an excellent job ensuring backwards compatibility while modernizing the tooling.

Additional Resources

Have you upgraded to Helm 4 yet? What’s your experience been? Share in the comments!

Lightweight Kubernetes for DevOps Testing: A Practical Guide to Colima

Rui Coelho — Tue, 18 Nov 2025 17:39:15 GMT

Test your Kubernetes deployments locally without the overhead — a hands-on guide for DevOps engineers.

As a DevOps engineer, you know the drill: you need to test a Helm chart, validate some YAML manifests, or experiment with a new Kubernetes feature. But spinning up a cloud cluster for every test is expensive and slow, and your current local setup is… let’s say “temperamental.”

What if you could have a lightweight, disposable Kubernetes cluster on your laptop that starts in seconds, uses minimal resources, and doesn’t require licensing fees?

Enter Colima — an open-source tool that gives you Docker and Kubernetes locally with almost zero configuration. In this guide, I’ll show you exactly how to set it up and use it effectively for real DevOps testing scenarios.

The Problem with Local Kubernetes Development

Let’s be honest: testing Kubernetes deployments locally has always been challenging.

The Common Pain Points:

Setting up a local Kubernetes cluster takes time and effort
Docker Desktop requires licensing for commercial use in larger companies
Cloud-based testing gets expensive quickly
You need to test changes before pushing to staging or production
Different projects often need different cluster configurations

What DevOps Engineers Actually Need:

Quick environment spin-up for testing manifests
Ability to test Helm charts locally
Validate deployments before they hit the cluster
Test disaster recovery procedures
Experiment with new Kubernetes features safely
Multiple isolated environments for different projects

This is exactly what Colima solves — a lightweight, free way to run Kubernetes locally with minimal friction.

What is Colima?

Colima (short for Containers on Lima) is a container runtime that provides Docker and Kubernetes on macOS and Linux with minimal setup.

Under the hood, Colima uses:

Lima (Linux Machines) to create lightweight Linux VMs
QEMU or Apple’s Virtualization.Framework for virtualization
K3s as the Kubernetes distribution (when enabled)
Docker or Containerd as the container runtime

Why DevOps Teams Choose Colima:

🚀 Fast startup — get testing in seconds, not minutes
💾 Low resource footprint — won’t slow down your laptop
⚡ Native Apple Silicon support for M1/M2/M3 Macs
🆓 Completely free and open source (MIT licensed)
🔧 Simple CLI interface
🎯 Multiple profiles for different projects or test scenarios
🔄 Compatible with standard Kubernetes tools (kubectl, Helm, Skaffold)

Getting Started: Installing Colima

Installation

On macOS:

# Install Colima
brew install colima
# Verify installation
colima version

On Linux:

# Install from binary
curl -LO https://github.com/abiosoft/colima/releases/latest/download/colima-Linux-x86_64
chmod +x colima-Linux-x86_64
sudo mv colima-Linux-x86_64 /usr/local/bin/colima

Basic Setup

Start Colima with Docker:

# Start with default settings (2 CPUs, 2GB RAM, 60GB disk)
colima start

# Verify it's running
colima status
# Test Docker
docker run hello-world

Start Colima with Kubernetes:

# Start with Kubernetes enabled
colima start --kubernetes

# Verify Kubernetes is running
kubectl cluster-info
kubectl get nodes

That’s it! You now have a working Kubernetes cluster on your laptop.

Real-World Configuration: Profiles for DevOps Testing

One of Colima’s best features for DevOps work is profiles — the ability to run multiple isolated Kubernetes environments for different testing scenarios.

Profile 1: Quick Manifest Testing

For validating YAML manifests and quick deployment tests:

bash

colima start -p quick-test \
  --kubernetes \
  --cpu 2 \
  --memory 4 \
  --disk 50

# Switch kubectl context
kubectl config use-context colima-quick-test

Use case: Test a deployment manifest before committing to Git

Profile 2: Helm Chart Development

For developing and testing Helm charts:

colima start -p helm-dev \
  --kubernetes \
  --cpu 4 \
  --memory 8 \
  --disk 100 \
  --network-address

# The --network-address flag allows LoadBalancer services to work

Use case: Test Helm charts with all service types (including LoadBalancer)

Profile 3: Disaster Recovery Testing

For testing backup/restore procedures and failure scenarios:

colima start -p dr-test \
  --kubernetes \
  --cpu 4 \
  --memory 8 \
  --disk 150

Use case: Simulate node failures, test etcd backups, practice recovery procedures

Profile 4: CI/CD Pipeline Validation

For testing your CI/CD pipelines locally before running them in production:

colima start -p pipeline-test \
  --kubernetes \
  --cpu 6 \
  --memory 12 \
  --disk 100

Use case: Validate GitHub Actions, GitLab CI, or Jenkins pipelines that deploy to Kubernetes

Key flags explained:

--vm-type vz: Uses Apple's native Virtualization.Framework (macOS 13+) for better performance
--mount-type virtiofs: Better file system performance for volume mounts
--network-address: Enables LoadBalancer service support

Managing Profiles

# List all profiles
colima list

# Stop a specific profile
colima stop -p k8s

# Delete a profile
colima delete -p old-project

# Switch between profiles
docker context use colima-dev
docker context use colima-k8s

Hands-On: Deploying Your First Application

Let’s deploy a real application to our local Kubernetes cluster.

Step 1: Create the Application

Create a simple nginx deployment:

# nginx-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx-demo
  labels:
    app: nginx
spec:
  replicas: 3
  selector:
    matchLabels:
      app: nginx
  template:
    metadata:
      labels:
        app: nginx
    spec:
      containers:
      - name: nginx
        image: nginx:latest
        ports:
        - containerPort: 80
---
apiVersion: v1
kind: Service
metadata:
  name: nginx-service
spec:
  selector:
    app: nginx
  ports:
    - protocol: TCP
      port: 80
      targetPort: 80
  type: LoadBalancer

Step 2: Deploy to Kubernetes

# Start Colima with Kubernetes and LoadBalancer support
colima start --kubernetes --network-address

# Apply the deployment
kubectl apply -f nginx-deployment.yaml

# Wait for pods to be ready
kubectl wait --for=condition=ready pod -l app=nginx --timeout=60s

# Check the deployment
kubectl get deployments
kubectl get pods
kubectl get services

Step 3: Access the Application

# Get the LoadBalancer IP
kubectl get svc nginx-service

# Access the application
curl http://

# Or use port-forward as an alternative
kubectl port-forward svc/nginx-service 8080:80

# Then access at http://localhost:8080

Step 4: Build and Deploy a Custom Image

# Build an image (it's automatically available in Kubernetes)
docker build -t my-app:v1 .

# Update your deployment to use the local image
kubectl set image deployment/my-app my-app=my-app:v1

# Verify the update
kubectl rollout status deployment/my-app

Pro Tip: With Docker runtime in Colima, images built with docker build are automatically available to Kubernetes—no need to push to a registry!

Advanced Configuration

Customizing Colima Settings

Colima uses a YAML configuration file. Edit it with:

colima start --edit

Example advanced configuration:

# Colima configuration
cpu: 4
memory: 8
disk: 100

# VM settings
runtime: docker
kubernetes:
  enabled: true
  version: v1.32.0
  ingress: true  # Automatically install nginx-ingress

# Network settings
network:
  address: true
  driver: slirp

# DNS settings
dns:
  - 8.8.8.8
  - 1.1.1.1

# Port forwarding
forward:
  - 8080:80
  - 5432:5432

# Volume mounts
mounts:
  - location: ~/projects
    writable: true
  - location: /tmp/colima
    writable: true

# Environment variables
env:
  DOCKER_BUILDKIT: "1"

Setting Up Ingress

Enable ingress for better service routing:

# Start with ingress enabled
colima start --kubernetes --kubernetes-ingress

# Or install manually
kubectl apply -f https://raw.githubusercontent.com/kubernetes/ingress-nginx/main/deploy/static/provider/kind/deploy.yaml

Example Ingress resource:

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: nginx-ingress
spec:
  rules:
  - host: nginx.local
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: nginx-service
            port:
              number: 80

Add to /etc/hosts:

echo "127.0.0.1 nginx.local" | sudo tee -a /etc/hosts

Persistent Storage

Colima now uses separate disks for container data, protecting against accidental data loss:

# Data persists even after deleting the instance
colima delete

# Restart and data is restored
colima start

# To delete everything including data
colima delete --data

Performance Optimization Tips

1. Use VZ on Modern Macs

On macOS 13+ (Ventura or later), use Apple’s native virtualization:

bash

colima start --vm-type vz --mount-type virtiofs

Performance improvements:

~30% faster startup
Better CPU efficiency
Improved file system performance

2. Optimize for Your Workload

For CPU-intensive tasks:

colima start --cpu 8 --cpu-type max

For memory-intensive tasks:

colima start --memory 16 --swap 0

For disk-intensive tasks:

colima start --disk 200 --mount-type virtiofs

3. Resource Monitoring

# Check resource usage
colima status

# SSH into the VM to check resources
colima ssh
top
df -h
free -m
exit

4. Build Performance

Enable BuildKit for faster Docker builds:

export DOCKER_BUILDKIT=1

# Or add to your shell profile
echo 'export DOCKER_BUILDKIT=1' >> ~/.zshrc

Troubleshooting Common Issues

Issue 1: Kubernetes Not Starting

Symptoms: kubectl commands hang or fail

Solution:

# Stop and start fresh
colima stop
colima start --kubernetes

# Check logs
colima logs

# Verify kubectl config
kubectl config view
kubectl cluster-info

Issue 2: LoadBalancer Services Stuck in Pending

Symptoms: EXTERNAL-IP shows

Solution:

# Restart with network address enabled
colima stop
colima start --kubernetes --network-address

# Verify networking
colima status

Issue 3: Docker Context Issues

Symptoms: docker commands fail with connection errors

Solution:

# List available contexts
docker context ls

# Switch to Colima context
docker context use colima

# Or set as default
docker context use colima --default

Issue 4: Volume Mount Performance

Symptoms: Slow file I/O in containers

Solution:

# Use virtiofs on modern Macs
colima stop
colima start --mount-type virtiofs

# Or use sshfs for better compatibility
colima start --mount-type sshfs

Issue 5: Port Conflicts

Symptoms: “Port already in use” errors

Solution:

# Check what's using the port
lsof -i :80

# Use a different profile
colima start -p myproject --edit
# Edit port forwards in the config file

Integration with Development Tools

VS Code Integration

Install the Docker extension and configure it to use Colima:

// settings.json
{
  "docker.dockerPath": "docker",
  "docker.dockerComposePath": "docker-compose",
  "kubernetes.kubectlPath": "/usr/local/bin/kubectl"
}

Helm Integration

Helm works seamlessly with Colima:

# Install Helm
brew install helm

# Add a repo
helm repo add bitnami https://charts.bitnami.com/bitnami

# Install a chart
helm install my-nginx bitnami/nginx

Skaffold for Rapid Development

# Install Skaffold
brew install skaffold

# Initialize in your project
skaffold init

# Start development mode
skaffold dev

Skaffold automatically detects Colima and rebuilds/redeploys on code changes.

Docker Compose

Docker Compose works out of the box:

# Install Docker Compose
brew install docker-compose

# Run your compose file
docker-compose up -d

Migration from Docker Desktop

Switching from Docker Desktop to Colima is straightforward:

Step 1: Export Your Data

# List your containers
docker ps -a

# Save important images
docker save my-app:latest -o my-app.tar

Step 2: Stop Docker Desktop

Quit Docker Desktop from the menu bar.

Step 3: Start Colima

colima start --cpu 4 --memory 8 --disk 100

Step 4: Import Your Data

# Load saved images
docker load -i my-app.tar

# Verify
docker images

Step 5: Update Your Workflow

# Replace Docker Desktop commands with:
colima start    # Instead of starting Docker Desktop
colima stop     # Instead of quitting Docker Desktop
colima status   # To check if it's running

Gotchas to Watch For:

Docker Desktop Kubernetes vs Colima K3s have slight differences
Some Docker Desktop-specific features (like file watching) work differently
Volume paths may need adjustment

Best Practices for Daily Use

1. Create Project-Specific Profiles

# For each major project
colima start -p frontend --cpu 2 --memory 4
colima start -p backend --cpu 4 --memory 8 --kubernetes
colima start -p testing --kubernetes --arch x86_64

2. Automate Startup

Add to your shell profile (~/.zshrc or ~/.bashrc):

# Auto-start default profile if not running
if ! colima status &> /dev/null; then
  colima start
fi

3. Use Docker Contexts

# Switch between profiles easily
alias dk-dev='docker context use colima-dev'
alias dk-k8s='docker context use colima-k8s'

4. Resource Management

# Stop unused profiles to save resources
colima stop -p old-project

# Clean up regularly
docker system prune -a
colima stop && colima start  # Fresh start

5. Backup Important Data

# Export containers you care about
docker export my-container > backup.tar

# Save images
docker save my-app:v1 -o my-app-v1.tar

Real-World DevOps Testing Scenarios

Scenario 1: Testing Helm Chart Changes

You’ve modified a Helm chart and need to validate it before pushing to production:

# Start a dedicated profile for Helm testing
colima start -p helm-test \
  --kubernetes \
  --cpu 4 \
  --memory 8

# Test your chart
helm install my-app ./charts/my-app --dry-run --debug
helm install my-app ./charts/my-app
helm test my-app

# Validate the deployment
kubectl get all -l app=my-app
kubectl logs -l app=my-app

# Clean up for next test
helm uninstall my-app

Scenario 2: Validating Kubernetes Manifests

Before committing YAML changes, test them locally:

# Start a quick test environment
colima start -p manifest-test \
  --kubernetes \
  --cpu 2 \
  --memory 4

# Validate syntax
kubectl apply --dry-run=client -f k8s/

# Apply and test
kubectl apply -f k8s/
kubectl wait --for=condition=ready pod -l app=myapp --timeout=60s

# Check for issues
kubectl get events --sort-by='.lastTimestamp'
kubectl describe pods -l app=myapp

Scenario 3: Testing Ingress Configurations

Validate ingress rules and SSL configurations:

# Start with ingress enabled
colima start -p ingress-test \
  --kubernetes \
  --kubernetes-ingress \
  --cpu 4 \
  --memory 6

# Apply your ingress
kubectl apply -f ingress.yaml

# Test locally
curl -H "Host: myapp.local" http://localhost
curl -k -H "Host: myapp.local" https://localhost

Scenario 4: Disaster Recovery Drills

Practice your disaster recovery procedures:

# Create a test cluster
colima start -p dr-drill \
  --kubernetes \
  --cpu 4 \
  --memory 8

# Deploy your application
kubectl apply -f production-manifests/

# Take a backup (using Velero or similar)
velero backup create test-backup

# Simulate disaster - delete everything
kubectl delete namespace production

# Practice recovery
velero restore create --from-backup test-backup

# Validate recovery
kubectl get all -n production

Scenario 5: CI/CD Pipeline Testing

Test your deployment pipeline before running it in production:

# Create a pipeline test environment
colima start -p ci-test \
  --kubernetes \
  --cpu 6 \
  --memory 10

# Run your deployment script locally
./scripts/deploy.sh --env=staging --dry-run
./scripts/deploy.sh --env=staging

# Verify the deployment
./scripts/smoke-tests.sh

Scenario 6: Testing Resource Limits

Validate that your resource requests and limits are properly configured:

# Start a cluster
colima start -p resource-test --kubernetes

# Deploy with resource constraints
kubectl apply -f deployment-with-limits.yaml

# Monitor resource usage
kubectl top pods
kubectl top nodes

# Test under load
kubectl run -it --rm load-generator \
  --image=busybox \
  --restart=Never \
  -- /bin/sh -c "while true; do wget -q -O- http://my-service; done"

Conclusion

Colima has become an essential tool for DevOps testing workflows, and here’s why it matters:

✅ Fast feedback loops — Test changes in seconds, not minutes
✅ Cost-effective — No cloud costs for every test iteration
✅ Isolated environments — Multiple profiles for different testing scenarios ✅ Production-like — Real Kubernetes, not a simulation
✅ No licensing hassles — Free for commercial use

Perfect For:

Testing Helm charts before deployment
Validating Kubernetes manifests locally
Disaster recovery drills
CI/CD pipeline development
Experimenting with new Kubernetes features
Training and knowledge transfer

Quick Start for DevOps Testing:

# Install
brew install colima docker kubectl

# Start with Kubernetes
colima start --kubernetes --network-address

# Test a deployment
kubectl create deployment nginx --image=nginx
kubectl expose deployment nginx --port=80 --type=LoadBalancer

# Verify
kubectl get all

That’s it. You now have a disposable Kubernetes environment that you can use, break, and recreate whenever you need to test something. No cloud costs, no complex setup, just a simple tool that gets out of your way.

For DevOps engineers who need to test quickly and iterate fast, Colima is a game-changer.

Additional Resources

Have you switched from Docker Desktop to Colima? What’s your experience been like? Share your thoughts in the comments!

ArgoCD 3.2: The Latest Stable Release Is Here

Rui Coelho — Tue, 18 Nov 2025 17:36:44 GMT

A deep dive into the newest features, improvements, and what you need to know before upgrading.

The GitOps community just received a significant update. ArgoCD 3.2.0 was released as a stable version on November 5th, 2025. If you’re running ArgoCD in production, this release deserves your immediate attention — especially if you’re still on version 2.14, which officially reached End of Life on November 4th, 2025.

In this article, we’ll explore what’s new, what’s changing, and how to prepare your team for the upgrade.

The Context: Why ArgoCD 3.2 Matters

Before diving into the features, let’s understand where ArgoCD 3.2 fits in the bigger picture.

The Evolution of ArgoCD 3.x

ArgoCD 3.0, released in early 2025, was a foundational release that introduced significant architectural improvements without being a risky upgrade. It refined RBAC controls, improved resource exclusions, and updated secrets management guidance.

Version 3.1, launched in August 2025, brought game-changing features like native OCI registry support, CLI plugins, and enhanced Source Hydrator functionality. These additions positioned ArgoCD as a more versatile GitOps tool for enterprise adoption.

The Critical Timeline

Here’s what you need to know: ArgoCD 2.14 reached End of Life (EOL) on November 4th, 2025. According to ArgoCD’s support policy, only the three most recent minor versions receive security updates and bug fixes. This means:

Currently supported: 3.2, 3.1, and 3.0
No longer supported: 2.14 and earlier versions
No more security patches or bug fixes for 2.14

If you’re still on 2.14 or earlier, you’re running an unsupported version. Your upgrade should be a top priority.

What’s New in ArgoCD 3.2

Let’s break down the key features and improvements coming in this release.

1. Enhanced ApplicationSet Progressive Sync

Progressive Sync for ApplicationSets has received significant improvements in 3.2. This feature allows you to roll out changes gradually across multiple applications, which is crucial for risk management in production environments.

What’s improved:

Better UI visibility: The ApplicationSet UI now properly displays Progressive Sync status, resolving the “Unknown” state issue that plagued previous versions
Status cleanup: When Progressive Sync is disabled, the ApplicationSet now correctly clears the applicationStatus field, preventing stale data
Resource count tracking: A new status.resourcesCount field provides visibility into the number of resources managed by each ApplicationSet

This last point is particularly important. Large ApplicationSets could previously cause performance issues due to unbounded resource tracking. The new resource count limit helps prevent this:

apiVersion: argoproj.io/v1alpha1
kind: ApplicationSet
metadata:
  name: my-appset
spec:
  # ... your spec
status:
  resourcesCount: 150  # New field
  resources:
    - # Limited list of resources

2. Memory Optimization for Webhook Handlers

If you’re operating large monorepos or high-traffic environments, you’ll appreciate this: ArgoCD 3.2 optimizes webhook handlers to use informers instead of direct API calls. This change significantly reduces memory consumption during webhook processing.

Why it matters: In environments with frequent Git commits or large repositories, webhook processing could cause memory spikes. The new implementation is more efficient and stable.

Important note: Users with very large monorepos may still encounter repo-server lock contention requiring pod restarts. The ArgoCD team has acknowledged this issue, and a fix is planned for the next patch release (3.2.1).

3. Updated Health Checks

Health assessment is a critical part of ArgoCD’s functionality. Version 3.2 includes several health check updates:

Crossplane V2 support: Health checks now support Crossplane V2 resources, reflecting the evolution of the Crossplane ecosystem
External Secrets Operator: The ExternalSecret discovery script now includes the refreshPolicy field for more accurate health assessment
PromotionStrategy corrections: Fixed typos in the Promotion health checks that could cause false negatives

4. OCI Registry Improvements

Building on the OCI support introduced in 3.1, version 3.2 loosens layer restrictions, making it easier to use OCI registries for storing Kubernetes configurations. This is part of ArgoCD’s broader strategy to treat configuration artifacts with the same maturity as container images.

5. CLI and Notifications Fixes

Several quality-of-life improvements for CLI users:

The notifications CLI now properly initializes the argocdService, fixing initialization issues
Webhook payload handlers now gracefully recover from panics instead of crashing
Various documentation improvements and bug fixes

Breaking Changes and Migration Considerations

ArgoCD 3.2 maintains the low-risk upgrade philosophy of the 3.x series. However, there are some considerations:

Coming from ArgoCD 2.14

If you’re upgrading from 2.14, you’ll need to account for all the breaking changes introduced in 3.0 and 3.1:

Major changes from 3.0:

Fine-grained RBAC no longer applies to sub-resources by default
Health status persistence changes (now disabled by default)
Default resource exclusions for high-churn resources
Dex authentication claim changes (uses federated_claims.user_id instead of sub)

Major changes from 3.1:

OCI registry support enabled
CLI plugins architecture
Source Hydrator enhancements

Recommended Upgrade Path

Read the docs first: Review the official upgrade guide for your version
Test in non-production: Deploy the RC in a staging environment
Backup your state: Ensure you have backups of your ApplicationSet and Application resources
Plan for RBAC: If coming from 2.x, audit your RBAC policies
Monitor after upgrade: Watch for memory usage patterns and webhook processing

Real-World Impact: Who Benefits Most?

Platform Engineering Teams

If you’re building internal developer platforms, the ApplicationSet improvements will help you manage hundreds of applications more efficiently. The resource count tracking and memory optimizations mean you can scale further without hitting performance walls.

Large Monorepo Users

The webhook handler optimizations directly address pain points for teams managing large repositories. Less memory pressure means more stable ArgoCD instances during high-commit periods.

Multi-Tenant Environments

The continued refinement of RBAC and the stability improvements make ArgoCD 3.2 more suitable for multi-tenant setups where different teams share the same ArgoCD instance but need strict isolation.

Crossplane Users

If you’re adopting Crossplane for infrastructure management alongside ArgoCD for application deployment, the updated health checks for Crossplane V2 ensure better visibility into your control plane resources.

Release Timeline and What’s Next

ArgoCD 3.2.0 was released as a stable version on November 5th, 2025. Here’s what to expect moving forward:

Current stable version: 3.2.0
Supported versions: 3.2, 3.1, and 3.0
Next release: 3.3 is expected in approximately 3 months (following the quarterly release cadence)
Patch releases: Bug fixes and security updates will be released as 3.2.x versions as needed

The first patch release (3.2.1) is expected soon to address the large monorepo lock contention issue.

Hands-On: Installing ArgoCD 3.2

Ready to try the latest stable release? Here’s how to deploy ArgoCD 3.2.0 in your cluster:

Installation via Helm (Recommended)

Helm is the recommended way to install ArgoCD in production environments as it provides better configuration management and easier upgrades.

Add the ArgoCD Helm repository:

helm repo add argo https://argoproj.github.io/argo-helm
helm repo update

Install ArgoCD 3.2.0:

# Create namespace
kubectl create namespace argocd

# Install with default configuration
helm install argocd argo/argo-cd \
  --namespace argocd \
  --version 9.1.0

# Or install with custom values
helm install argocd argo/argo-cd \
  --namespace argocd \
  --version 9.1.0 \
  --values values.yaml

Example values.yaml for production:

# Enable HA mode
redis-ha:
  enabled: true

controller:
  replicas: 2

server:
  replicas: 2
  ingress:
    enabled: true
    ingressClassName: nginx
    hosts:
      - argocd.yourdomain.com
    tls:
      - secretName: argocd-tls
        hosts:
          - argocd.yourdomain.com

repoServer:
  replicas: 2

applicationSet:
  replicas: 2

# Enable notifications
notifications:
  enabled: true

# Configure resource limits
configs:
  params:
    server.insecure: false

Upgrade from previous version:

helm upgrade argocd argo/argo-cd \
  --namespace argocd \
  --version 9.1.0 \
  --values values.yaml

Installation via kubectl (Quick Start)

For testing or non-production environments, you can use kubectl directly:

Non-HA Installation:

kubectl create namespace argocd
kubectl apply -n argocd -f https://raw.githubusercontent.com/argoproj/argo-cd/v3.2.0/manifests/install.yaml

High Availability Installation:

kubectl create namespace argocd
kubectl apply -n argocd -f https://raw.githubusercontent.com/argoproj/argo-cd/v3.2.0/manifests/ha/install.yaml

Verifying the Installation

Check ArgoCD pods status:

kubectl get pods -n argocd

# Expected output (HA installation):
# NAME                                               READY   STATUS    RESTARTS   AGE
# argocd-application-controller-0                    1/1     Running   0          2m
# argocd-applicationset-controller-xxx               1/1     Running   0          2m
# argocd-dex-server-xxx                              1/1     Running   0          2m
# argocd-notifications-controller-xxx                1/1     Running   0          2m
# argocd-redis-ha-haproxy-xxx                        1/1     Running   0          2m
# argocd-redis-ha-server-0                           2/2     Running   0          2m
# argocd-repo-server-xxx                             1/1     Running   0          2m
# argocd-server-xxx                                  1/1     Running   0          2m

Verify the ArgoCD version:

bash

kubectl get pods -n argocd -l app.kubernetes.io/name=argocd-server -o jsonpath='{.items[0].spec.containers[0].image}'

# Should return: quay.io/argoproj/argocd:v3.2.0

Access the ArgoCD UI:

# Port-forward (for quick access)
kubectl port-forward svc/argocd-server -n argocd 8080:443

# Then access: https://localhost:8080

Get the initial admin password:

bash

# For Helm installations
kubectl -n argocd get secret argocd-initial-admin-secret -o jsonpath="{.data.password}" | base64 -d && echo

# For kubectl installations (same command)
kubectl -n argocd get secret argocd-initial-admin-secret -o jsonpath="{.data.password}" | base64 -d && echo

Login via CLI:

# Install ArgoCD CLI
brew install argocd  # macOS

# or
curl -sSL -o argocd https://github.com/argoproj/argo-cd/releases/download/v3.2.0/argocd-linux-amd64
chmod +x argocd
sudo mv argocd /usr/local/bin/

# Login
argocd login localhost:8080 --username admin --password  --insecure
# Verify version
argocd version

What to Test

Focus your testing on:

ApplicationSet Progressive Sync: Create an ApplicationSet with progressive sync enabled and verify the UI shows proper status
Memory usage: Monitor memory consumption during webhook processing
Health checks: If you use Crossplane, verify V2 resources show correct health status
RBAC: Validate your existing policies work as expected

Looking Ahead: The Future of ArgoCD

ArgoCD 3.2 continues the project’s trajectory toward becoming the definitive GitOps tool for Kubernetes. Some trends worth watching:

OCI-Native Configuration

The push toward OCI registries suggests a future where Kubernetes configurations are treated as first-class artifacts, with the same supply chain security guarantees as container images.

Progressive Delivery Integration

The improvements to Progressive Sync hint at deeper integration with progressive delivery patterns. We may see more sophisticated rollout strategies in future versions.

Platform Engineering Enablement

With better scalability and multi-tenancy support, ArgoCD is positioning itself as a critical component of internal developer platforms, not just a deployment tool.

Action Items: Upgrading to ArgoCD 3.2

Here’s your checklist:

Immediate (if on 2.14 or earlier — you’re on EOL!):

Upgrade immediately — you’re no longer receiving security patches
Review the 3.0, 3.1, and 3.2 release notes
Audit your RBAC policies for 3.0 compatibility
Set up a test environment with ArgoCD 3.2
Test your critical ApplicationSets and Applications

Before production upgrade:

Document your current ArgoCD configuration
Backup your ArgoCD state
Create a rollback plan
Schedule a maintenance window for production upgrade
Review the known issues (large monorepo lock contention)

Post-upgrade:

Monitor memory usage patterns
Validate all Applications sync correctly
Check webhook processing performance
Review ApplicationSet statuses
Watch for 3.2.1 patch release if you have large monorepos

Conclusion

ArgoCD 3.2 is now stable and production-ready. This release represents a measured evolution of the platform — not revolutionary, but a refinement that makes ArgoCD more stable, performant, and user-friendly. The ApplicationSet improvements, memory optimizations, and updated health checks address real pain points that operators face in production.

If you’re on ArgoCD 2.14, you’re running an unsupported version. Upgrade immediately. If you’re already on 3.0 or 3.1, upgrading to 3.2 should be low-risk, but still test thoroughly, especially if you operate large monorepos.

The GitOps ecosystem continues to mature, and ArgoCD remains at the forefront. Version 3.2 is another solid step in that journey, and with the EOL of 2.14, there’s no better time to upgrade than now.

Additional Resources

Have you upgraded to ArgoCD 3.2 yet? What’s your experience been like? Share your thoughts in the comments below!

From 0 to Hero: Mastering Auto Scaling in Kubernetes

Rui Coelho — Tue, 18 Nov 2025 17:25:39 GMT

Scaling applications is one of the hardest challenges in cloud-native environments. With Kubernetes, you get powerful autoscaling primitives that make workloads adaptive, resilient, and cost-efficient.

In this guide, we’ll go from zero to hero in Kubernetes autoscaling, covering:

Horizontal Pod Autoscaler (HPA)
Vertical Pod Autoscaler (VPA)
Cluster Autoscaler (CA)
Cluster Proportional Autoscaler (CPA)
KEDA (Kubernetes Event-Driven Autoscaling)

We will explore how each works, which components are involved, what metrics they use, best practices, and what the future holds.

Why Auto Scaling Matters

Without autoscaling, you either:

Overprovision resources → wasting money
Underprovision resources → degraded performance or downtime

Kubernetes addresses this with autoscaling mechanisms at different levels:

Pod Level → HPA & VPA
Node Level → Cluster Autoscaler (CA)
Cluster Add-on Level → CPA
Event-Driven Level → KEDA

Horizontal Pod Autoscaler (HPA)

The HPA adjusts the number of pod replicas in a Deployment, StatefulSet, or ReplicaSet.

How it works

Periodically checks metrics (default every 15s).
Compares observed values to target thresholds.
Adjusts the .spec.replicas field of the workload.

Components in Detail

HPA Controller: Runs inside the kube-controller-manager; makes scaling decisions.
Metrics Server: Collects CPU/Memory usage from kubelets and exposes resource metrics.
Custom Metrics Adapter: Connects with systems like Prometheus to expose application-level metrics.
External Metrics Adapter: Integrates with external services (CloudWatch, Pub/Sub, SQS) for business-level signals.

Types of Metrics Supported

1. Resource Metrics (Native)

These come from the Metrics Server. They cover CPU and memory utilization.

metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70

2. Custom Metrics

Custom metrics allow scaling based on application-level signals when CPU/Memory are not good indicators of load.

Sourced from inside the cluster, typically via Prometheus and an adapter.
Examples include:
HTTP requests per second
Average response latency
Active sessions or connections

metrics:
  - type: Pods
    pods:
      metric:
        name: http_requests_per_second
      target:
        type: AverageValue
        averageValue: "100"

This example scales based on an average of 100 requests per pod.

3. External Metrics

External metrics allow scaling based on signals outside the cluster.

Examples include:

Messages in an AWS SQS queue
Pending tasks in GCP Pub/Sub
Business KPIs such as number of orders waiting

metrics:
  - type: External
    external:
      metric:
        name: queue_messages_ready
      target:
        type: AverageValue
        averageValue: "50"

This example scales when there are more than 50 pending messages in a queue.

Custom Metrics vs External Metrics

Type	Source	Examples	Adapter Required
Custom	Inside the cluster (Prometheus, application internals)	Requests/sec, latency, active sessions	Yes
External	Outside the cluster (cloud services, APIs)	Queue length, Pub/Sub backlog, business KPIs	Yes

Rule of thumb: use Custom metrics when the signal is inside the cluster; use External metrics when the signal comes from outside.

Pros

Native, widely supported.
Works with CPU, memory, custom, and external metrics.

Limitations

Does not resize pod resources, only replicas.
Requires adapters for advanced metrics.

Vertical Pod Autoscaler (VPA)

The VPA automatically adjusts CPU and memory requests/limits for pods.

How it works

Continuously observes resource usage.
Provides recommendations or enforces new requests/limits.
Applies changes according to its operating mode.

Components in Detail

Recommender: Analyzes metrics and suggests optimal CPU/memory.
Updater: Decides when to evict pods to apply new values.
Admission Controller (Plugin): Mutates pod specs on creation with recommended resources.

VPA Modes

Off

Provides recommendations only.
Useful for testing and observability.

2. Initial

Applies recommended resources only at pod creation.
Pods keep the same resources until deleted or recreated.

3. Auto

Continuously adjusts resources.
May evict and restart pods to apply new requests/limits.

Example

apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: recommendation-vpa
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: recommendation
  updatePolicy:
    updateMode: "Auto"   # Options: Off, Initial, Auto

Pros

Prevents under/over-provisioning.
Adapts pods to real usage.

Limitations

Pod restarts required to apply changes.
Should not be combined with HPA on the same resource.

Cluster Autoscaler (CA)

The Cluster Autoscaler manages the number of nodes in your cluster.

How it works

Scales up when pods cannot be scheduled.
Scales down when nodes are underutilized.
Relies on integration with the underlying cloud provider.

Components in Detail

CA Controller: Observes scheduling failures and underutilized nodes.
Cloud Provider Integration: Uses APIs to add/remove nodes (AWS ASGs, GCP MIGs, Azure VMSS).
Scale-down Logic: Removes nodes only if it will not disrupt critical workloads.

Pros

Ensures the cluster has enough capacity.
Saves costs by scaling down idle nodes.

Limitations

Cloud-provider specific.
Conservative when scaling down.

Cluster Proportional Autoscaler (CPA)

The CPA is specialized for scaling cluster add-ons such as CoreDNS. Unlike the Cluster Autoscaler, which changes the number of nodes, the CPA adjusts the number of replicas of add-on workloads so they grow proportionally with the cluster.

How it works

Monitors cluster size (nodes or CPU cores).
Adjusts replicas of add-on components proportionally.
Typically used for Deployments such as CoreDNS.

Components in Detail

CPA Controller: Runs as a deployment in the kube-system namespace.
Scaling Config: Defines proportional rules (linear or ladder).

Example Config (conceptual)

linear:
  coresPerReplica: 256
  nodesPerReplica: 16
  min: 1
  max: 10

This means: 1 replica per 256 CPU cores or 16 nodes, capped between 1–10 replicas.

Why it is not always used

In many environments, CPA is not strictly required because DaemonSets already provide proportional coverage. A DaemonSet ensures that each node runs exactly one pod (for example, kube-proxy or a logging agent). This one-pod-per-node model automatically scales with the cluster as nodes are added or removed.

As a result:

CPA is most useful for Deployments where proportional scaling is needed.
For simpler infrastructure components, a DaemonSet is often enough and eliminates the need for CPA.

Pros

Keeps critical add-ons responsive as the cluster grows.
Provides proportional scaling for Deployments.

Limitations

Not needed when DaemonSets are sufficient (common in simpler setups).
Focused on infrastructure-level services, not application workloads.

KEDA (Kubernetes Event-Driven Autoscaling)

KEDA extends Kubernetes with event-driven scaling capabilities. While the HPA traditionally reacts to resource usage or custom/external metrics, KEDA lets workloads scale based on event sources such as message queues, databases, or cloud services.

How it works

You define a ScaledObject (for Deployments/StatefulSets) or a ScaledJob (for batch workloads).
KEDA deploys an Operator and a Metrics Adapter into the cluster.
Scalers fetch metrics from external systems and expose them through the Kubernetes metrics API.
The HPA (managed by KEDA) consumes those metrics to scale workloads.
KEDA can scale down workloads to zero when no events are present, something native HPA cannot do.

Components in Detail

KEDA Operator: Watches CRDs like ScaledObject and ScaledJob, and creates an HPA automatically for the target workload.
Metrics Adapter: Exposes metrics from scalers to the HPA using the Kubernetes external metrics API.
Scalers: Plugins that connect to external systems. KEDA supports more than 50 scalers including Kafka, RabbitMQ, AWS SQS, Azure Service Bus, GCP Pub/Sub, Prometheus, MySQL, PostgreSQL, and more.
ScaledObject: Defines autoscaling for Deployments or StatefulSets.
ScaledJob: Defines autoscaling for Jobs — each event can spawn a new Job until the backlog is cleared.

Example: RabbitMQ

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: orders-worker
spec:
  scaleTargetRef:
    name: orders-deployment
  minReplicaCount: 0
  maxReplicaCount: 20
  triggers:
    - type: rabbitmq
      metadata:
        queueName: orders
        hostFromEnv: RABBITMQ_HOST
        queueLength: "5"

This configuration will:

Scale the orders-deployment from 0 to 20 replicas.
Trigger scaling when there are more than 5 messages in the RabbitMQ orders queue.
Scale back to zero when the queue is empty.

Example: Batch Jobs with ScaledJob

apiVersion: keda.sh/v1alpha1
kind: ScaledJob
metadata:
  name: image-processor
spec:
  jobTargetRef:
    parallelism: 1
    completions: 1
    template:
      spec:
        containers:
        - name: worker
          image: my-image-processor:latest
        restartPolicy: Never
  pollingInterval: 30
  maxReplicaCount: 50
  triggers:
    - type: azure-queue
      metadata:
        queueName: images
        connection: AzureWebJobsStorage

This configuration spawns new Jobs for image processing tasks whenever new messages arrive in the Azure Queue.

Use Cases

Event-driven microservices (workers consuming from queues).
Serverless-style workloads that run only on demand.
Batch processing pipelines (images, logs, data).
Scaling based on cloud services (databases, queues, monitoring systems).

Pros

Event-driven: scale workloads based on real demand.
Scale-to-zero: cost-efficient for workloads with idle periods.
Wide ecosystem of scalers (cloud-native and traditional systems).
Works alongside HPA and integrates natively into Kubernetes.

Limitations

Adds operational complexity (extra components in the cluster).
Requires careful configuration of triggers to avoid over-scaling or flapping.
Each scaler has its own configuration specifics.

Cost Optimization and Trade-offs

Autoscaling is not just about performance — it directly impacts cost. Configuring it poorly can lead to unnecessary expenses or under utilization.

Unbounded scaling can increase costs dramatically: Always set a sensible maxReplicas to prevent runaway scaling in case of metric spikes.
Scale-to-zero saves money: KEDA’s ability to scale to zero during idle periods can significantly reduce cloud bills for workloads that are not always active.
Right-sizing matters: Combine VPA recommendations with HPA scaling to avoid oversized pods being multiplied unnecessarily.
Cluster Autoscaler trade-offs: While CA saves money by removing nodes, frequent scale-up and scale-down events may increase cloud costs (e.g., by breaking node-level discounts).

Cooldowns, Stabilization Windows, and Advanced HPA Features

By default, HPA reacts quickly to metric changes, but in production environments, rapid scaling up and down can cause instability. Kubernetes offers advanced options to control scaling behavior:

Stabilization Windows

A stabilization window defines a minimum period before scaling actions are reconsidered.

behavior:
  scaleDown:
    stabilizationWindowSeconds: 300

This keeps replicas stable for at least 5 minutes before reducing them.

Scaling Policies

Scaling policies let you limit how aggressively the HPA scales.

behavior:
  scaleUp:
    policies:
    - type: Pods
      value: 2
      periodSeconds: 60
    - type: Percent
      value: 100
      periodSeconds: 60

This configuration allows scaling up by at most 2 pods per minute or doubling the replica count per minute, whichever is lower.

Why it matters

Prevents “thrashing” where replicas fluctuate constantly.
Helps maintain system stability under unpredictable load.
Provides more predictable cost and resource usage.

Observability and Monitoring

Autoscaling decisions are only as good as the signals they are based on. Without observability, you cannot validate whether scaling actions improve performance or simply add cost.

What to Monitor

Autoscaler status: Replica counts over time (HPA, VPA, KEDA, CA).
Scaling decisions: Why did the autoscaler decide to add/remove replicas?
Business KPIs: Requests per second, queue lengths, user sessions — to ensure scaling aligns with actual demand.
Cost impact: Correlate scaling events with cloud spend.

Tools and Integrations

Prometheus + Grafana: The most common stack to visualize metrics and scaling decisions.
kube-state-metrics: Exposes HPA/VPA/CA objects and their current state for Prometheus.
Datadog, New Relic, Dynatrace: SaaS observability platforms with built-in Kubernetes autoscaling dashboards.
Cloud provider monitoring: AWS CloudWatch, GCP Monitoring, Azure Monitor provide integration with CA and KEDA.

Why it matters

Ensures that scaling actions align with application performance, not just resource usage.
Detects misconfigurations early (e.g., HPA scaling on wrong metric).
Provides insights to fine-tune thresholds and stabilization windows.

Comparing the Autoscalers

Autoscaler	Components	Scope	Metrics	Best For	Scale to 0?
HPA	Controller Manager, Metrics Server, Adapters	Pod replicas	Resource, Custom, External	Stateless apps	No
VPA	Recommender, Updater, Admission Controller	Pod resources	Historical usage	Stateful/ML workloads	No
CA	CA Controller, Cloud APIs	Nodes	Cluster utilization	Node-level elasticity	No
CPA	CPA Controller, Scaling Config	Add-ons	Cluster size	CoreDNS, kube-proxy	No
KEDA	Operator, Metrics Adapter, Scalers	Pods & Jobs	Event-driven signals	Workers, serverless jobs, batch pipelines	Yes

Best Practices for Autoscaling

Always set minReplicas and maxReplicas to avoid runaway scaling.
Avoid using HPA and VPA on the same deployment (at least with the same metrics);
Ensure metrics reflect business reality, not just CPU.
Run load tests to fine-tune thresholds.
Configure cooldown periods to prevent thrashing.
For KEDA, adjust polling intervals carefully.
Monitor both system-level and business-level signals to validate scaling behavior.

The Future of Autoscaling in Kubernetes

Autoscaling is evolving rapidly:

Growing adoption of event-driven scaling with KEDA.
Research into predictive autoscaling using AI/ML.
Work on autoscaler composition (HPA + VPA + KEDA together).
A shift toward policy-driven, autonomous scaling clusters.

The future points toward Kubernetes clusters that self-optimize without human intervention.

From Zero to Hero: A Real Example

For an e-commerce platform:

Frontend API → HPA scaling based on CPU + requests/sec.
Recommendation engine → VPA right-sizes pods for ML models (Auto mode).
Cluster Autoscaler (CA) → Adds nodes when HPA demands exceed current capacity.
CoreDNS → CPA scales proportionally with cluster size, unless a simple DaemonSet is enough.
Order workers → KEDA scales with RabbitMQ queue length (down to zero).

This combination ensures performance, stability, and cost efficiency.

Conclusion

Kubernetes autoscaling is not a single feature — it is an ecosystem:

HPA manages pod replicas with multiple metric sources.
VPA right-sizes pods dynamically, with modes for every scenario.
CA ensures enough nodes exist.
CPA keeps cluster add-ons healthy, though often replaced by DaemonSets in simpler cases.
KEDA powers event-driven and serverless scaling, with support for batch jobs and scale-to-zero.
Cost optimization, advanced HPA features, and observability ensure autoscaling is efficient, stable, and financially sustainable.

By mastering these components, you can truly go from zero to hero in Kubernetes autoscaling.

How to Use Kubernetes Dynamic Resource Allocation (DRA) — Real Use Case with GPUs

Rui Coelho — Tue, 18 Nov 2025 17:15:05 GMT

Introduction

With Kubernetes 1.34, Dynamic Resource Allocation (DRA) is now generally available (GA) — enabling pods to request and allocate specialized hardware like GPUs, FPGAs, and high-speed storage dynamically.

This article covers:

What DRA is and how it works
Use cases for AI/ML workloads
A real example using NVIDIA GPUs
Updated YAML with DeviceClass, ResourceClaim, and resourceClaims
Before/After comparison
A quick FAQ

What is Dynamic Resource Allocation (DRA)?

DRA is a Kubernetes feature that allows workloads to request non-CPU/memory resources dynamically, using a plugin-based architecture. Examples include:

GPUs
Smart NICs
NVMe SSDs
FPGAs
Inference accelerators

Traditionally, these were provisioned statically, often leading to poor resource utilization or complicated scheduling.

DRA enables:
✅ On-demand allocation at scheduling time
✅ Plugin-based orchestration
✅ Cleanup and deallocation when pods terminate
✅ Better resource utilization and isolation

This is done via a plugin interface and a few new Kubernetes objects:

DeviceClass — Declares the type of device and scheduling criteria
ResourceClaim — A claim on a resource from a DeviceClass
ResourceClaimTemplate — Used by workloads to request resource claims automatically
resourceClaims[] — Pod field that binds to one or more ResourceClaim
ResourceSlice — Represents available resources managed by a plugin

Before & After: Static GPU Allocation vs DRA

Before: Static Allocation (Pre-DRA)

apiVersion: v1
kind: Pod
metadata:
  name: gpu-pod-static
spec:
  containers:
  - name: cuda
    image: nvidia/cuda:12.2.0-base
    command: ["nvidia-smi"]
    resources:
      limits:
        nvidia.com/gpu: "1"

Requires node pre-configuration
GPU is reserved even if idle
No dynamic lifecycle or cleanup

After: Dynamic Allocation with DRA (Kubernetes 1.34+)

# DeviceClass definition
apiVersion: resource.k8s.io/v1
kind: DeviceClass
metadata:
  name: nvidia-gpu
description: "NVIDIA GPU for ML workloads"
schedulingPolicy:
  minAllocatable: "1"
suitableNodes:
  nodeSelectorTerms:
    - matchExpressions:
        - key: nvidia.com/gpu.present
          operator: In
          values: ["true"]
---
# ResourceClaim
apiVersion: resource.k8s.io/v1
kind: ResourceClaim
metadata:
  name: gpu-claim
spec:
  deviceClassName: nvidia-gpu
  allocationMode: Immediate
  parametersRef:
    kind: ConfigMap
    name: gpu-config
    apiGroup: ""
---
# Pod definition
apiVersion: v1
kind: Pod
metadata:
  name: gpu-pod-dra
spec:
  containers:
  - name: cuda
    image: nvidia/cuda:12.2.0-base
    command: ["nvidia-smi"]
    resources:
      limits:
        nvidia.com/gpu: "1"
  resourceClaims:
  - name: gpu
    source:
      resourceClaimName: gpu-claim
  restartPolicy: Never

Resources provisioned at scheduling time
Plugin handles setup, teardown, isolation
Clean and flexible

Benefits of Using DRA

Benefit	Description
🔄 Dynamic provisioning	Devices are allocated only when needed
📉 Efficiency	No wasted static reservations
🔐 Better isolation	Plugin manages lifecycle and security
📦 Clean YAML	Declarative hardware requests
⚙️ Plugin extensibility	Support for many device types (e.g. GPUs, FPGAs)

How to Use DRA in Your Cluster

Upgrade your cluster to Kubernetes 1.34+
Install a DRA-compatible plugin
Example: NVIDIA DRA driver
Define your DeviceClass
Create ResourceClaims or use templates
Reference them in your pods with resourceClaims[]

Pro Tips

Use ResourceClaimTemplate to auto-create claims per pod
Use CEL filters in DeviceClass for attribute-based scheduling
Monitor claim status to detect allocation failures
Use node selectors in DeviceClass to ensure compatible hardware

FAQ: Kubernetes DRA

Q: Do I need to install anything to use DRA?
A: Yes — a compatible DRA plugin (e.g. NVIDIA, CXL, etc.)

Q: Is DRA stable in Kubernetes 1.34?
A: Yes — it is GA as of Kubernetes 1.34 (August 2025)

Q: Can I use DRA for memory or CPU?
A: No — DRA is specifically for non-CPU/memory resources

Q: What if the plugin crashes or fails to allocate?
A: The pod will not be scheduled. Kubernetes can retry. You can use fallback logic.

Q: What replaces the NVIDIA device plugin?
A: Nothing — the DRA driver complements it. You still need the device plugin to expose hardware to the container.

Final Thoughts

DRA is one of the most significant enhancements in Kubernetes scheduling in years — especially for AI/ML, HPC, or hybrid workloads that need specialized hardware.

If you’re using Kubernetes 1.34, it’s time to start:
✅ Testing DRA
✅ Installing plugins
✅ Modernizing your GPU/NIC/storage allocation flows

Let me know if you’re using DRA already or planning to — I’d love to hear how it’s working for your team!

Kubernetes 1.33 vs 1.34: What’s New, What Changed, and Why It Matters

Rui Coelho — Tue, 18 Nov 2025 17:10:45 GMT

Introduction

The Kubernetes release cycle continues to deliver powerful improvements in performance, security, and resource orchestration. With Kubernetes 1.34 released in August 2025, it’s a good time to compare it to 1.33 (released in April) and understand what’s new, what’s changed, and how these updates impact real-world DevOps and cloud-native environments.

This article breaks down the key changes between versions 1.33 and 1.34, with a focus on practical benefits, feature maturity, and what you should consider before upgrading.

At a Glance: Release Timeline

Kubernetes 1.33 “Octarine” — Released April 23, 2025
Kubernetes 1.34 “Of Wind & Will” — Released August 27, 2025

Key Feature Comparisons

Feature	Kubernetes 1.33	Kubernetes 1.34	Why it matters
Dynamic Resource Allocation	Beta	GA (Stable)	GPU/NIC resources can now be orchestrated dynamically, ideal for ML/AI workloads.
In-Place Pod Resize	Beta	Improved Beta	No need to restart pods to adjust CPU/memory. Reduced downtime.
CEL Mutating Admission Policies	Not available	Alpha	Declarative admission control within API server; no external webhooks needed.
Native Sidecar Containers	GA	Stable	Cleaner lifecycle control for service mesh/logging sidecars.
Streaming Informers	Not available	Alpha	Less memory usage and better responsiveness in high-load clusters.

Dynamic Resource Allocation (DRA)

Dynamic Resource Allocation allows workloads to request and manage specialized resources like GPUs, FPGAs, and network devices dynamically. With version 1.34, DRA is now stable, making it production-ready for clusters with high-performance compute needs.

Why it matters:
If you’re deploying ML/AI workloads or working with hardware accelerators, Kubernetes now supports dynamic orchestration natively.

In-Place Pod Resource Resize

Kubernetes now allows you to resize CPU and memory allocations for running Pods without recreating them. Version 1.34 enhances this by supporting downscaling and expanding support for Pod-level resources.

Why it matters:
This reduces downtime and offers greater flexibility in autoscaling and operational tuning.

CEL Mutating Admission Policies

Version 1.34 introduces Mutating Admission Policies using CEL (Common Expression Language), allowing you to modify API objects without using external webhooks.

Why it matters:
This is a big step toward declarative, low-latency admission controls within the API server, reducing complexity and latency.

Native Sidecar Containers

Sidecars became a first-class citizen in 1.33, and 1.34 builds on that foundation. This change improves container lifecycle management and makes it easier to integrate service meshes and logging agents.

Why it matters:
Developers no longer need workarounds for container lifecycle synchronization.

Streaming Informers for Better Observability

Streaming informers allow high-throughput systems to stream changes from the API server without excessive memory usage, particularly for large LIST operations.

Why it matters:
Improves cluster stability under heavy load, and simplifies building responsive operator logic.

Security and Authentication Updates

Feature	1.33	1.34
User namespaces	Enabled by default	Continued
Service account token improvements	✅	✅
Mutual TLS (mTLS) for Pod-to-Pod auth	❌	Alpha in 1.34
Structured client certs	❌	✅

Why it matters:
Security improvements are gradual but meaningful. Mutual TLS between Pods is a major step forward for zero-trust cluster designs.

⚠️ Deprecations & Migration Considerations

1.33 introduced some deprecated API fields, especially around token and namespace handling.
1.34 is a safer upgrade, with fewer breaking changes — but you should still test against deprecated API usage and custom controllers.

Tip: Use kubectl deprecations or static analysis tools to scan your manifests before upgrading.

Should You Upgrade to Kubernetes 1.34?

✅ Yes, if:

You rely on specialized hardware (GPU, NICs, etc.)
You want to reduce resource management downtime
You need better control over admission and policy enforcement
You’re preparing your clusters for security hardening

❌ Maybe not yet, if:

You rely on 3rd-party admission controllers not yet compatible with CEL policies
Your environment is still catching up with 1.32 or earlier
You avoid alpha features and prefer longer stabilization cycles

Final Thoughts

Kubernetes 1.34 builds smartly on top of 1.33, delivering improved operational control, resource efficiency, and extensibility. While it’s not a revolutionary release, it brings several practical, production-focused enhancements that make it worth the upgrade — especially for teams focused on AI/ML workloads, observability, and multi-tenant security.

Resources

This article breaks down the key changes between versions 1.33 and 1.34, with a focus on practical benefits, feature maturity, and what you should consider before upgrading.

Sneak Peek into Kubernetes v1.34: What’s Coming and Why It Matters

Rui Coelho — Tue, 18 Nov 2025 17:00:08 GMT

Kubernetes continues its steady evolution, and with the upcoming v1.34 release, there are some exciting enhancements on the horizon. Scheduled for release in late August 2025, this version focuses on observability, smarter resource handling, enhanced scheduling, and developer experience improvements.

Here’s a preview of the most impactful features coming to Kubernetes v1.34 — based on the official sneak peek — along with concrete examples and thoughts on what this means for platform teams.

Dynamic Resource Allocation (DRA) Goes Stable

One of the biggest highlights: Dynamic Resource Allocation (DRA) is going GA (Generally Available).

DRA allows Kubernetes to schedule Pods that require dynamic or external resources (like GPUs, FPGAs, or licensed software) by coordinating with resource drivers. This opens the door for smarter, safer scheduling without race conditions or pre-binding issues.

Example Use Case

apiVersion: v1
kind: Pod
metadata:
  name: gpu-task
spec:
  containers:
  - name: compute
    image: nvidia/cuda:11.0-base
    resources:
      claims:
      - name: my-gpu
  resourceClaims:
  - name: my-gpu
    source:
      resourceClassName: nvidia.com/gpu

With DRA, Kubernetes dynamically allocates the required GPU from a compatible pool — improving isolation and usage efficiency.

ServiceAccount Tokens for Image Pulls (Beta)

Previously, long-lived imagePullSecrets were used to authenticate against container registries. Kubernetes v1.34 introduces automatic short-lived tokens derived from ServiceAccounts, allowing the kubelet to securely authenticate image pulls using projected tokens.

This change reduces token leakage risks and simplifies rotation.

Key Benefit

You no longer need to manually create and mount secrets for private registry authentication — Kubernetes handles it for you securely and automatically.

Pod Replacement Policy (Alpha)

In complex deployments, replacing Pods too early (while the old one is still terminating) can cause unexpected issues like conflicting ports, database reconnects, or resource contention.

Kubernetes v1.34 introduces a new field:

spec:
  podReplacementPolicy: TerminationStarted  # or TerminationComplete

This allows developers to control when a new Pod should be scheduled during rolling updates.

Example Use Case

apiVersion: apps/v1
kind: Deployment
metadata:
  name: api-server
spec:
  replicas: 3
  strategy:
    type: RollingUpdate
  template:
    spec:
      podReplacementPolicy: TerminationComplete

With TerminationComplete, Kubernetes waits for the old Pod to fully shut down before starting a replacement — perfect for stateful or resource-heavy applications.

PreferSameNode and PreferSameZone (Beta)

Load balancing just got smarter. Instead of relying on externalTrafficPolicy: Local and hoping for the best, Kubernetes now introduces more precise topology preferences in routing.

This allows Services to prefer nodes/zones that are closer to the caller — improving latency and network performance.

Why It Matters

In multi-zone clusters or edge deployments, this routing optimization can drastically reduce cross-zone traffic — reducing cost and improving user experience.

KYAML Output Format

Here’s where KYAML truly shines. In my Medium article, “KYAML in Kubernetes v1.34: A Safer, Leaner Alternative to YAML and JSON”, I explore why KYAML is designed to overcome YAML’s quirks — like whitespace sensitivity, implicit type coercion (hello, “Norway Bug”!) — and offer a cleaner, more predictable serialization format.

You can now use:

kubectl get deployment api-server -o kyaml

Expect more consistent diffs, better tooling support, and fewer surprises.

Observability with Tracing (Beta/Stable)

Kubernetes v1.34 embraces OpenTelemetry-based tracing in both the API Server and kubelet. Now you get visibility into:

gRPC communications between kubelet and container runtimes
Admission control chains
API request lifecycle tracing

This enhanced observability is invaluable for diagnosing control plane latency or debugging performance regressions.

Bonus Highlights (v1.34 Release — Of Wind and Will)

Additionally, the final v1.34 release includes:

PSI Metrics (Beta) — better insights into CPU & memory pressure.
Node Swap Support (GA) — smoother memory management, fewer OOMs.
CPUManager Uncore Cache Alignment (Beta) — improved performance for NUMA-aware workloads.
kuberc (User Preferences) — customize kubectl defaults and output.

Why It Matters for Platform Teams

This release is operator-focused, tackling real-world challenges:

Secure, lightweight image pulls
Fine control over rollout timing
Richer observability (tracing and PSI)
Smarter, topology-aware routing
Cleaner, safer configuration via KYAML

What to Try Next

Enable DRA for GPU or other specialized hardware.
Switch to ServiceAccount token pulls — more secure and simpler.
Experiment with Pod replacement policies in canary/rolling deployments.
Activate tracing in staging — boost observability.
Start using KYAML and see the cleaner diffs — I dive into this in my article! Medium

Resources

In Summary

Kubernetes v1.34 is a quietly powerful release — improving the developer and operator experience without flashy headlines. From KYAML to tracing, resource efficiency to rollout control, it’s a substantial step forward.

Check out my Medium article for a deep dive into KYAML, and let me know what features you’re exploring.

KYAML in Kubernetes v1.34: A Safer, Leaner Alternative to YAML and JSON

Rui Coelho — Tue, 18 Nov 2025 15:49:58 GMT

Introduction

In the cloud-native ecosystem, YAML and JSON have been the de facto formats for writing Kubernetes manifests and configuration files. But both come with trade-offs. As of Kubernetes v1.34, a new configuration dialect — KYAML — emerges to bridge the gap: safer than YAML, more flexible than JSON, and fully compatible with existing tooling.

The Case Against Traditional YAML and JSON

YAML: Whitespace Hazards & Implicit Typing

YAML’s human-friendly syntax often hides pitfalls:

Whitespace sensitivity: A single misplaced space can restructure a manifest unexpectedly — a nightmare in templating systems like Helm.
Implicit typing: Unquoted values like NO, yes, or 1:23 can be coerced into boolean or numeric types—famously known as the "Norway Bug." For example, country: NO may inadvertently become false. The StrictYAML community removed implicit typing precisely to avoid that kind of ambiguity. The HitchDev blog discusses this problem in depth under the name “The Norway Problem”.

JSON: Valid, But Minus the Human Touch

JSON is stricter and more predictable — no implicit types or indentation issues — but it lacks comments, forbids trailing commas, and mandates quoted keys, making it less practical for human-authored manifest files.

Enter KYAML: The Best of Both Worlds

As revealed in the Kubernetes v1.34 Sneak Peek (July 28, 2025), KYAML is a Kubernetes-specific dialect of YAML designed to eliminate common configuration errors while remaining compatible with existing tooling.

Key features of KYAML include:

All string values are double‑quoted, avoiding implicit coercion (e.g., "NO" remains a string);
Flow-style syntax: Always uses {} for mappings and [] for lists, which greatly reduces whitespace sensitivity;
Comments are allowed, retaining readability and documentation ability — something JSON forbids;
Trailing commas are permitted, enabling cleaner edits and diffs;
Unquoted keys by default, unless ambiguous, preserving clarity and brevity;

As a strict subset of YAML, KYAML ensures all valid KYAML is still valid YAML — so existing parsers and tools continue to work seamlessly.

KYAML in Action: Kubernetes v1.34 (Alpha)

In Kubernetes v1.34 (released August 27, 2025), KYAML is introduced as an alpha feature, meaning it’s optional — but available for experimentation.

You can enable KYAML output in kubectl using:

export KUBECTL_KYAML=true
kubectl get  -o kyaml

All existing YAML and JSON output formats remain supported.

Why KYAML Matters

KYAML isn’t just syntactic sugar — it addresses real pain points:

Challenge	YAML	JSON	KYAML (v1.34)
Type coercion	Implicit, often surprising	None	Strings always quoted
Indentation issues	Very sensitive	None	Flow-style avoids indentation reliance
Comments	Supported	Not supported	Supported
Trailing commas	Optional	Not allowed	Allowed
Tooling compatibility	Broad	Broad	Fully compatible with YAML tools

Early adopters report KYAML can reduce deployment errors, especially in GitOps workflows, audit processes, and CI/CD contexts.

The Road Ahead: KEP‑5295 and Community Feedback

KYAML is formalized under KEP‑5295, introduced by SIG CLI. The proposal includes:

KYAML as a new kubectl output format.
Plans to eventually make KYAML the standard format for Kubernetes documentation and examples.

Community reactions are mixed. On Reddit, some users praise KYAML for retaining compatibility while reducing ambiguity, while others feel it’s “uglier” or slower to type due to braces and quotes.

Sample Comparison

Traditional YAML (error-prone):

apiVersion: v1
kind: ConfigMap
data:
  country: NO
  version: 1.0

Possible pitfalls: country becomes boolean false, floats inferred, and indentation errors risk breakage.

KYAML Equivalent:

apiVersion: "v1"
kind: "ConfigMap"
data: {
  country: "NO",
  version: "1.0",
}

Clear types, predictable structure — comments allowed, trailing comma included.

Real-World Comparison: YAML vs KYAML in Action

Let’s compare the actual output of a real Kubernetes object — the kubernetes service running in the default namespace—retrieved via kubectl.

YAML Output

kubectl get svc kubernetes -o yaml

apiVersion: v1
kind: Service
metadata:
  creationTimestamp: "2025-09-06T12:12:51Z"
  labels:
    component: apiserver
    provider: kubernetes
  name: kubernetes
  namespace: default
  resourceVersion: "243"
  uid: d1f8264c-60a1-418f-bc69-511aec01691a
spec:
  clusterIP: 172.20.0.1
  clusterIPs:
  - 172.20.0.1
  internalTrafficPolicy: Cluster
  ipFamilies:
  - IPv4
  ipFamilyPolicy: SingleStack
  ports:
  - name: https
    port: 443
    protocol: TCP
    targetPort: 6443
  sessionAffinity: None
  type: ClusterIP
status:
  loadBalancer: {}

This format is readable but carries all the risks of whitespace sensitivity, lack of quoting, and potential type coercion (e.g., None, Cluster, or IPs could be parsed differently in various YAML parsers).

KYAML Output

kubectl get svc kubernetes -o kyaml

---
{
  apiVersion: "v1",
  kind: "Service",
  metadata: {
    creationTimestamp: "2025-09-06T12:12:51Z",
    labels: {
      component: "apiserver",
      provider: "kubernetes",
    },
    name: "kubernetes",
    namespace: "default",
    resourceVersion: "243",
    uid: "d1f8264c-60a1-418f-bc69-511aec01691a",
  },
  spec: {
    clusterIP: "172.20.0.1",
    clusterIPs: [
      "172.20.0.1",
    ],
    internalTrafficPolicy: "Cluster",
    ipFamilies: [
      "IPv4",
    ],
    ipFamilyPolicy: "SingleStack",
    ports: [{
      name: "https",
      port: 443,
      protocol: "TCP",
      targetPort: 6443,
    }],
    sessionAffinity: "None",
    type: "ClusterIP",
  },
  status: {
    loadBalancer: {},
  },
}

Here we clearly see KYAML’s strengths:

All strings are explicitly quoted
Flow-style syntax makes the structure explicit
Comments are allowed (not shown here, but supported)
Trailing commas allowed for clean diffs

Even a subtle misinterpretation like None being treated as a Python-style null (instead of a string) is avoided thanks to strict quoting.

Conclusion: Do We Really Need a YAML-JSON Hybrid?

KYAML — introduced in Kubernetes v1.34 under KEP‑5295 — is clearly an attempt to bring predictability and structure to the sometimes frustrating world of YAML-based configuration.

It solves real problems: it removes implicit typing, supports trailing commas and comments, and avoids whitespace-related bugs. It’s fully backwards compatible with YAML tooling and enables safer GitOps workflows. On paper, it’s an elegant step forward.

But speaking honestly… I’m still not sure if I like it.

At first glance, KYAML looks a lot like JSON, especially due to its flow-style syntax with {} and []. That resemblance can be disorienting, especially when you’re expecting a YAML file. It introduces more visual noise—quotes, commas, and brackets—which can make simple configurations feel bloated compared to traditional YAML.

Do we really need a hybrid between YAML and JSON?
Or are we just adding another format to an already overloaded toolchain?

It feels like a compromise between two imperfect formats — trying to be safer than YAML while remaining more user-friendly than JSON. But compromises can sometimes bring new complexity rather than clarity.

That said, it’s still early days. KYAML is in alpha, and its adoption will likely depend on how tooling, IDEs, and human workflows evolve around it. If it gains traction, KYAML could become the default in Kubernetes’ future — but whether it becomes loved is another matter.

Bonus: KYAML Meme of the Month

Spotted on Twitter recently:

And… well, they’re not wrong.
KYAML definitely looks like that kind of hybrid at first sight. 🐧➕🐘 = 🤯

Streamlining Communication: Sending Webhook Messages from Heroku to Discord

Rui Coelho — Tue, 18 Nov 2025 14:52:42 GMT

In this article, we delve into the seamless integration of Heroku, a popular cloud platform, with Discord, a widely used communication platform. The focus is on sending webhook messages from Heroku to Discord, enhancing real-time updates and notifications for various applications. The article guides readers through the process, offering a step-by-step approach to set up and configure webhooks, ensuring efficient and automated communication between Heroku-hosted applications and Discord channels. With this integration, users can stay informed, monitor events, and streamline communication effortlessly.

In order to communicate between Heroku and Discord we need to create a small application to parse our webhook messages to the format from Discord. The app contains the following aspects:

Features

1. Express.js Server Setup:
— Utilizes the Express.js framework to create a web server.
— Listens on the specified port (process.env.PORT or defaulting to 3000).

2. Heroku Webhook Handling:
— Provides a route (/heroku-webhook) to handle incoming Heroku webhook events.
— Extracts relevant data from the Heroku webhook payload, including event data and metadata.
— Formats a message based on the extracted data to provide a concise summary of the Heroku event.

3. Discord Webhook Integration:
— Utilizes the Axios library to send a formatted payload to a Discord webhook.
— The Discord payload includes an embed with information about the Heroku event.

4. Environment Variable Usage:
— Uses process.env to read the PORT and DISCORD_WEBHOOK environment variables.s.
—Provides default values for PORT (3000) and logs the DISCORD_WEBHOOK value.

You can find the code I used on my GitHub Repository.

Let’s code

We will need some dependencies to create our application. Let’s install them using the following commands:

npm init
npm install axios@1.6.5
npm install body-parser@1.20.2
npm install express@4.18.2

You should be able to obtain a package.json file similar to mine:

{
  "name": "heroku-discord",
  "version": "1.0.0",
  "description": "Parser for heroku webhook messages to discord webhook messages",
  "main": "index.js",
  "scripts": {
    "test": "echo \"Error: no test specified\" && exit 1"
  },
  "repository": {
    "type": "git",
    "url": "git+https://github.com/WebCDC/heroku-discord.git"
  },
  "author": "Rui Coelho",
  "license": "ISC",
  "bugs": {
    "url": "https://github.com/WebCDC/heroku-discord/issues"
  },
  "homepage": "https://github.com/WebCDC/heroku-discord#readme",
  "dependencies": {
    "axios": "^1.6.5",
    "body-parser": "^1.20.2",
    "express": "^4.18.2"
  }
}

Let’s create an index.js file:

const express = require('express');
const bodyParser = require('body-parser');
const axios = require('axios');

const app = express();
const port = process.env.PORT || 3000;
const webhook = process.env.DISCORD_WEBHOOK;

console.log(webhook)

app.use(bodyParser.json());

app.post('/heroku-webhook', (req, res) => {
  // Extract relevant data from Heroku webhook payload
  const herokuEventData = req.body.data;
  const herokuEventMetada = req.body.webhook_metadata;


  const message = `Event ID => ${herokuEventMetada.event.id}\nEvent Type => ${herokuEventMetada.event.include}\nTriggered by => ${herokuEventData.user.email}\nStatus => ${herokuEventData.status}`


  // Transform data to match Discord webhook payload structure
  const discordPayload = {
    embeds: [
      {
        title: 'Heroku Notification',
        description: message,
        fields: [
          { name: 'App Name', value: herokuEventData.app.name, inline: true },
          // Add more fields as needed
        ],
      },
    ],
  };

  // Send payload to Discord webhook
  axios.post(`${webhook}`, discordPayload)
    .then(() => res.sendStatus(200))
    .catch(error => {
      console.error('Error sending Discord webhook:', error);
      res.sendStatus(500);
    });
});

app.listen(port, () => {
  console.log(`Server is running on port ${port}`);
});

Let’s break this code down.

Dependencies

express: A web application framework for Node.js.
body-parser: Middleware to parse incoming request bodies in a middleware before your handlers.
axios: A promise-based HTTP client for the browser and Node.js.

const express = require('express');
const bodyParser = require('body-parser');
const axios = require('axios');

Application Setup

Creates an instance of the Express application.
Defines the port for the server to run on (using the environment variable PORT or defaulting to 3000).
Retrieves the Discord webhook URL from the environment variable DISCORD_WEBHOOK.

const app = express();
const port = process.env.PORT || 3000;
const webhook = process.env.DISCORD_WEBHOOK;

Middleware Setup

Uses body-parser middleware to parse incoming JSON requests.

app.use(bodyParser.json());

Webhook Handling

Defines a route (/heroku-webhook) to handle incoming POST requests from Heroku.
Extracts relevant data from the Heroku webhook payload.
Formats a message using the extracted data.
Creates a Discord payload with an embed object containing the formatted message and additional information.
Sends the payload to the specified Discord webhook.

app.post('/heroku-webhook', (req, res) => {
  // ... (Handling Heroku webhook data)

  // Transform data to match Discord webhook payload structure
  const discordPayload = {
    embeds: [
      {
        title: 'Heroku Notification',
        description: message,
        fields: [
          { name: 'App Name', value: herokuEventData.app.name, inline: true },
          // Add more fields as needed
        ],
      },
    ],
  };

  // Send payload to Discord webhook
  axios.post(`${webhook}`, discordPayload)
    .then(() => res.sendStatus(200))
    .catch(error => {
      console.error('Error sending Discord webhook:', error);
      res.sendStatus(500);
    });
});

Server Initialization

Starts the server and listens on the specified port.

app.listen(port, () => {
  console.log(`Server is running on port ${port}`);
});

Get Discord WebHook Token

To get a webhook for Discord, go to Server Settings > Integrations > Webhooks, then create a webhook and copy the provided URL for message posting.

Press enter or click to view image in full size

Executing the project

Now, we need to define our webhook as an environment variable. This can be achieved by using dotenv. I will employ the export command in Linux

export DISCORD_WEBHOOK=https://discord.com/api/webhooks/...

Now you just need to execute your application:

node index.js

https://discord.com/api/webhooks/...
Server is running on port 3000

Your application will need public access, and you can deploy it using Heroku as well.

Set webhook on Heroku

On you project (the one that you want to send messages from webhooks) go to More -> View Webhooks -> Create Webhook

Configure your webhook settings and save them.

Note: The webhook URL should be the address for the application that is currently under development.

Test notifications

Now you can initiate a build for the application where you want to utilize the webhook. In your Discord channel, you should receive a notification similar to the following:

Conclusion

The integration of webhook messages from Heroku to Discord offers a powerful solution for streamlining communication in your development workflow. By seamlessly connecting these platforms, you enhance real-time updates and notifications, fostering collaboration and efficiency among your team members. The straightforward implementation and flexibility of webhooks empower you to tailor the communication process to fit your specific needs, creating a more responsive and dynamic development environment. As technology continues to evolve, embracing tools like webhooks not only simplifies communication but also contributes to a more connected and agile development process. Ultimately, by leveraging the synergy between Heroku and Discord, you can propel your project forward with timely and relevant information, enhancing overall productivity and success.

You can find the code I used on my GitHub Repository.

GitHub Pages with custom DNS

Rui Coelho — Tue, 18 Nov 2025 14:47:14 GMT

GitHub Pages is a static site hosting service offered by GitHub. It allows users to host personal, organization, or project pages directly from a GitHub repository. This service is commonly used for hosting documentation, personal blogs, or simple websites.

Key Features

Static Site Hosting: GitHub Pages hosts static websites directly from a GitHub repository, making it easy to publish and maintain web content.
Jekyll Integration: It supports Jekyll, a popular static site generator, allowing for easy customization and theming of websites.
Custom Domain Support: GitHub Pages allows users to use a custom domain for their websites, providing flexibility in branding and hosting.
Version Control Integration: Since it is integrated with GitHub, version control and collaboration features are readily available for managing website content.
Free Hosting: GitHub Pages provides free hosting for static websites, making it an attractive option for personal and small-scale projects.

Getting started

To get started with GitHub Pages, you can create a new repository on GitHub and enable GitHub Pages in the repository settings. You can then customize your website using Jekyll or by simply pushing HTML, CSS, and JavaScript files to the repository.

Custom DNS

GitHub Pages allows you to use a custom domain for your websites, which means you can point a domain you own to your GitHub Pages site. This is often referred to as setting up custom DNS for GitHub Pages. By configuring the DNS settings for your domain, you can make it resolve to your GitHub Pages site, effectively associating your custom domain with your GitHub Pages website.

Create CNAME file

To create a CNAME file for your custom domain, follow these steps:

Create a new file in the root of your GitHub Pages repository.
Name the file “CNAME” (without any file extension).
Open the “CNAME” file and add your custom domain (e.g., example.com) as the content of the file.

You can consult an example at CNAME.

Crate DNS CNAME record

Navigate to your DNS provider and create a CNAME record that points your subdomain to the default domain for your site. For example, if you want to use the subdomain www.example.com for your user site, create a CNAME record that points www.example.com to .github.io.

If you want to use the subdomain another.example.com for your organization site, create a CNAME record that points another.example.com to .github.io. The CNAME record should always point to .github.io or .github.io, excluding the repository name. For more information about how to create the correct record, see your DNS provider’s documentation. For more information about the default domain for your site, see “About GitHub Pages”.

Configure DNS with Cloudflare

If you are using Cloudflare as your DNS provider, follow these steps to configure your custom domain:

Log in to your Cloudflare account.
Navigate to your domain’s DNS settings.
Add a CNAME record with the name “www” and point it to .github.io.
Ensure the CNAME record is proxied through Cloudflare to benefit from their security and performance features.

Press enter or click to view image in full size

Configure GitHub Pages

To set up a www or custom subdomain, such as www.example.com or blog.example.com, you must add your domain in the repository settings. After that, configure a CNAME record with your DNS provider.

On GitHub, navigate to your site’s repository.

Under your repository name, click Settings. If you cannot see the “Settings” tab, select the dropdown menu, then click Settings.

Screenshot of a repository header showing the tabs. The “Settings” tab is highlighted by a dark orange outline.

In the “Code and automation” section of the sidebar, click Pages.

Press enter or click to view image in full size

DevOps

ArgoCD v3.2.5: Critical Patch Release with Stability Improvements

Introduction

🎯 Why ArgoCD v3.2.5 Matters

The v3.2 Context

⚠️ Support Warning

🔧 Key Changes in v3.2.5

1. Notifications Engine Update

2. ApplicationSet Reconciliation Fix

3. Error Message Improvements

4. Dependency Updates

🚀 How to Upgrade to v3.2.5

Option 1: Non-HA Installation (Single Instance)

Option 2: HA Installation (High Availability)

Option 3: Via Helm Chart

🔐 Security Verification

📊 Compatibility and Support

Supported Architectures

Kubernetes Platforms

🎓 Migrating from v2.x to v3.x

Key Behavioral Changes in v3.x

1. Fine-Grained RBAC by Default

2. Tracking by Annotations (Default)

3. RBAC on Logs Enabled

Detailed Upgrade Guide

🆕 Featured v3.2 Capabilities

1. Kustomize Version Selection via Git

2. Server-Side Diff

3. Pull Request Title Matching

4. Health Checks for GitOps Promoter

📈 Performance and Observability

ApplicationSet Resource Limits

Recommended Monitoring

🔮 ArgoCD Roadmap

v3.3 (Expected February 2026)

Community Events

✅ Upgrade Checklist

🛠️ Common Troubleshooting

Problem: ApplicationSets Reconciling Excessively

Problem: Notifications Not Arriving

Problem: RBAC Denying Log Access

📚 Additional Resources

Official Documentation

Related Articles

Complementary Tools

💡 Conclusion

Terraform vs OpenTofu: A Comprehensive Comparison for Infrastructure as Code

The Origin Story: Understanding the Fork

Terraform’s License Change

The Birth of OpenTofu

Core Architecture: More Similar Than Different

The HCL Configuration Language

State Management

Key Differences: Where They Diverge

1. Licensing and Governance

2. State File Encryption

3. Enhanced Testing Framework

4. Provider Development and Registry

Real-World Comparison: A Practical Example

Complete Infrastructure Module

Migration: Moving Between Terraform and OpenTofu

Migrating from Terraform to OpenTofu

Migrating from OpenTofu to Terraform

Gradual Migration Strategy

Feature Comparison Matrix

Core Functionality

Advanced Features

Operational Aspects

Real-World Use Cases and Recommendations

When to Choose Terraform

When to Choose OpenTofu

Community and Ecosystem

Community Support

Provider Availability

Future Outlook

Terraform’s Direction

OpenTofu’s Roadmap

Decision Framework

Decision Tree

Evaluation Checklist

Using `try()` for Optional Fields