Introduction
Hello, friends. Also, salute, bitter enemies.
Today, we’re talk about using the Delve debugger to debug a remote pod. Are you using Kubernetes? Open Shift? Something else? It doesn’t matter, homeslice, because what I’m describing today should be applicable to any container orchestration tool.
I’m going to simulate a real situation I was faced with at a past job, and I felt it would be helpful and illuminative to describe the issue and how it was debugged. This kind of remote debugging is something that I’m sure is quite common, and what I will outline is just one way to do it, but that is beside the point. The importance of this is describing why this is important, because it gives visibility and exposure into how to gain access and debug something that is not local on your workstation. Without these kinds of processes, fixing a bug is just guesswork. And we want to remove the guesswork and ambiguity and focus on core issues that are well understood.
The victim will be my go-simple-chat program, which is a multi-threaded chat messaging program that allow any number of clients connect to a central server that manages all of the connections. This means that there needs to be two open ports that will be port forwarded, but for your own example, it could be just one.
Let’s get started. We’ll begin with local debugging and ramp up to remote debugging.
Local Debugging
Now, let’s say that we need to debug this program. I know, since I programmed it, it’s hard to imagine that there would be anything wrong with it, but let’s pretend like we did when we were children.
There are two ways to debug a multi-threaded client/server paradigm program.
dlv-exec
Have Delve start the program as a child process:
$ dlv exec ./go-simple-chat
Type 'help' for list of commands.
(dlv) funcs main
main.(*Client).Broadcast
main.(*Client).Listen
main.(*Client).Start
main.(*Client).Start.gowrap1
main.(*Client).Start.gowrap2
main.(*Client).closeChannel
main.(*Client).closeChannel.func1
main.NewChat
main.NewClient
main.deregister
main.handleNewConnection
main.init
main.main
main.main.func1
main.main.func2
main.main.func2.gowrap1
main.register
main.shutdown
net.absDomainName
net.isDomainName
runtime.main
runtime.main.func1
runtime.main.func2
type:.eq.main.ChatMessage
type:.eq.main.Client
(dlv) b main.handleNewConnection
Breakpoint 1 set at 0x4f68b3 for main.handleNewConnection() ./main.go:43
(dlv) c
Simple chat server started and listening on :9999
> [Breakpoint 1] main.handleNewConnection() ./main.go:43 (hits goroutine(12):1 total:1) (PC: 0x4f68b3)
Warning: debugging optimized function
Warning: listing may not match stale executable
38: return &Chat{
39: c: make(ChatRoom),
40: broadcast: make(chan ChatMessage, broadcastBuffer),
41: mu: &sync.Mutex{},
42: }
=> 43: }
44:
45: func handleNewConnection(ctx context.Context, conn net.Conn) {
46: _, err := conn.Write([]byte("What's your name?: "))
47: if err != nil {
48: fmt.Printf("err=%+v\n", err)
(dlv)
The key here is to get past the part of the program that binds to the port so that clients can connect to it. That is why I listed all of the functions in the main package and selected one (main.handleNewConnection) that was past the point of binding and into the infinite loop where client connections are handled.
Then, open one or more terminals and start connecting clients. In the output above, you can see that the breakpoint was hit when I did just that.
dlv-attach
Attach Delve to an already-running program.
$ ps aux | ag [s]imple-chat
btoll 701153 0.0 0.0 1746624 11760 pts/4 Sl+ 14:11 0:00 ./go-simple-chat
$ dlv attach 701153
Type 'help' for list of commands.
(dlv) b main.NewClient
Breakpoint 1 set at 0x4f6d37 for main.handleNewConnection() ./client.go:23
(dlv)
When running without debugging (i.e., not having Delve run the program), we can see that the state is Sl+, which means that the program:
S- interruptible sleep (waiting for an event to complete)l- is multi-threaded (using CLONE_THREAD, like NPTL pthreads do)+- is in the foreground process group
As opposed to using Delve to launch the program:
$ ps aux | ag [s]imple-chat
btoll 724644 4.0 0.2 6497756 35852 pts/4 Sl+ 14:47 0:00 dlv exec ./go-simple-chat
btoll 724652 0.0 0.0 2932 200 pts/4 t 14:47 0:00 /home/btoll/projects/go-simple-chat/go-simple-chat
Here, we can see that dlv has the same process state codes as when the program is run directly. The difference is the process state of the go-simple-chat program itself, which was spawned as a child process of dlv:
t- stopped by debugger during the tracing
You can see that go-simple-chat is a child of dlv by using pstree with the PID of dlv (taken from the previous command output):
$ pstree -p 724644
dlv(724644)─┬─dlv(724651)─┬─{dlv}(724654)
│ ├─{dlv}(724655)
│ ├─{dlv}(724656)
│ ├─{dlv}(724657)
│ └─{dlv}(724658)
├─go-simple-chat(724652)
├─{dlv}(724645)
├─{dlv}(724646)
├─{dlv}(724647)
├─{dlv}(724648)
├─{dlv}(724649)
├─{dlv}(724653)
├─{dlv}(724659)
├─{dlv}(724660)
├─{dlv}(724661)
├─{dlv}(724662)
├─{dlv}(724663)
├─{dlv}(724664)
└─{dlv}(724665)
All process state code definitions are taken directly from the
psman page.
Let’s run it in a container:
$ podman run --rm -p 9999:9999 btoll/go-simple-chat:latest
That’s really nifty. But, how do we debug when the programs that we’re interested in are running in a completely different network? How do we gain the visibility we need so we’re not guessing as to the root cause of the issue?
In my experience, this has occurred when I’ve been working with a container orchestration tool like Kubernetes or OpenShift.
To have full control and to avoid any potential side effects (and to avoid giving any money to grotesque humans like Jeff Bezos), it’s optimal to spin up a cluster locally. However, frequently there are impenetrable layers of infrastructure and code to comb through before the call can even be made that it’s feasible.
The next best thing is to create a cluster in a testing environment, usually in the cloud. I’ve debugged pods remotely that have been in OpenShift, EKS and on-premise clusters.
Remote Debugging
For doing this on-prem, as the cool kids say, or just locally, as normal people do, we’ll use a lightweight container orchestrator and a simple deployment.
Assembling The Parts
Or, as the say in Boston, assembling the pahts.
What do we need to assemble to have passable replicated system on our local system? Here they are:
-
Container orchestration.
-
Deployment
deployment.yamlapiVersion: apps/v1 kind: Deployment metadata: name: go-simple-chat namespace: default spec: replicas: 3 selector: matchLabels: app: go-simple-chat template: metadata: labels: app: go-simple-chat spec: containers: - name: go-simple-chat image: btoll/go-simple-chat-debug:latest imagePullPolicy: Always ports: - containerPort: 9999 name: chat - containerPort: 2345 name: debug resources: requests: memory: 64Mi cpu: 250m limits: memory: 128Mi cpu: 500m --- apiVersion: v1 kind: Service metadata: name: go-simple-chat namespace: default spec: type: ClusterIP selector: app: go-simple-chat ports: - port: 9999 targetPort: 9999 protocol: TCP name: chat - port: 2345 targetPort: 2345 protocol: TCP name: debugThe most important bits about this deployment are the defined ports. Port 9999 is for the chat program, and port 2345 is for Delve. They will both be port forwarded to the local network.
-
The debug container image.
The original
Dockerfilefor the chat program needs to have debug information added to it. I’m showing it here to compare it with what needs to be done to prepare it for our debugging session.DockerfileFROM golang:1.25.11-trixie AS builder WORKDIR /app COPY go.mod main.go client.go ./ RUN CGO_ENABLED=0 GOOS=linux go build -o go-simple-chat FROM scratch COPY --from=builder /app/go-simple-chat / EXPOSE 9999 ENTRYPOINT ["/go-simple-chat"]The debug Dockerfile installs the Delve debugger as well as compiles
go-simple-chatwith inlining disabled and the symbol table intact.Dockerfile.debugFROM golang:1.25.11-trixie AS builder WORKDIR /app COPY go.mod main.go client.go ./ RUN go install github.com/go-delve/delve/cmd/dlv@latest && \ CGO_ENABLED=0 GOOS=linux go build -gcflags="all=-N -l" -o go-simple-chat FROM golang:1.25.11-trixie WORKDIR /app COPY --from=builder /go/bin/dlv /dlv COPY --from=builder /app/main.go /app/client.go ./ COPY --from=builder /app/go-simple-chat / EXPOSE 9999 2345 ENTRYPOINT ["/dlv", "exec", "/go-simple-chat", "--headless", "--listen=0.0.0.0:2345", "--api-version=2"]In addition, it will start Delve (rather than the
go-simple-chatprogram), listening on port 2345 on all interfaces. Also, not that the source files (main.go,client.go) are copied to the final stage so that the Delve debugger can source them. -
Delve debugger.
Let’s download and start minikube:
$ curl -LO https://storage.googleapis.com/minikube/releases/latest/minikube_latest_amd64.deb
$ sudo dpkg -i minikube_latest_amd64.deb
$ MINIKUBE_ROOTLESS=false minikube start --driver=docker
Apply the deployment:
$ kubectl apply -f deployment.yaml
deployment.apps/go-simple-chat created
service/go-simple-chat created
$ kubectl get all
NAME READY STATUS RESTARTS AGE
pod/go-simple-chat-76f8dc7585-9p4zl 1/1 Running 0 2m38s
pod/go-simple-chat-76f8dc7585-fw9ps 1/1 Running 0 2m38s
pod/go-simple-chat-76f8dc7585-hjwf5 1/1 Running 0 2m38s
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/go-simple-chat ClusterIP 10.97.190.61 <none> 9999/TCP,2345/TCP 2m38s
service/kubernetes ClusterIP 10.96.0.1 <none> 443/TCP 43h
NAME READY UP-TO-DATE AVAILABLE AGE
deployment.apps/go-simple-chat 3/3 3 3 2m38s
NAME DESIRED CURRENT READY AGE
replicaset.apps/go-simple-chat-76f8dc7585 3 3 3 2m38s
And, here are the open ports that the service is listening on:
$ kubectl get svc go-simple-chat -ojsonpath="{.spec.ports[*].port}"
9999 2345
To debug, first enable port forwarding and bind the remote port 2345 to the local network:
$ kubectl port-forward svc/go-simple-chat 2345:2345
Forwarding from 127.0.0.1:2345 -> 2345
Forwarding from [::1]:2345 -> 2345
Then, in another terminal, connect to localhost on that port. You should be dropped into a dlv console, at which point you should map the location of the source files main.go and client.go in the remote container to their location on your local machine so Delve has access to them (unless they’re in the same absolute path in both locations). Then, set your breakpoint. Here, it’s in the main.handleNewConnection function, but do what you want, I’m not the boss of you:
$ dlv connect 127.0.0.1:2345
Type 'help' for list of commands.
(dlv) config substitute-path /app /home/btoll/projects/go-simple-chat
(dlv) b main.handleNewConnection
Breakpoint 1 set at 0x7aa516 for main.handleNewConnection() ./main.go:45
Delve has as created the go-simple-chat as a child process, and the stage has been set for any number of clients to be connected to the program and then controlled by Delve. Here we are port forwarding 9999 to localhost and then connecting a client. This can be done in two different terminals, or just one by sending the port forwarding command to the background:
$ kubectl port-forward svc/go-simple-chat 9999:9999
Forwarding from 127.0.0.1:9999 -> 9999
Forwarding from [::1]:9999 -> 9999
Handling connection for 9999
$ nc localhost 9999
Here, we’re opening another terminal and connect using
netcat, but you can just as easily use another client likstelnet. The program doesn’t have its own programmed client.
Depending where you set your breakpoint, the program may block until you advance it, as it did just above when using netcat to connect to the server. In the other terminal pane (if you’re using a terminal multiplexer like screen or tmux, which, of course you are), you’ll see that the breakpoint has been hit and is waiting for your engagement:
$ dlv connect 127.0.0.1:2345
Type 'help' for list of commands.
(dlv) config substitute-path /app /home/btoll/projects/go-simple-chat
(dlv) b main.handleNewConnection
Breakpoint 1 set at 0x7aa516 for main.handleNewConnection() ./main.go:45
(dlv) c
> [Breakpoint 1] main.handleNewConnection() ./main.go:45 (hits goroutine(4):1 total:1) (PC: 0x7aa516)
40: broadcast: make(chan ChatMessage, broadcastBuffer),
41: mu: &sync.Mutex{},
42: }
43: }
44:
=> 45: func handleNewConnection(ctx context.Context, conn net.Conn) {
46: _, err := conn.Write([]byte("What's your name?: "))
47: if err != nil {
48: fmt.Printf("err=%+v\n", err)
49: conn.Close()
50: return
(dlv)
To advance, just enter c (or continue) and you’ll be prompted for your name:
What's your name?: Ben
Welcome to the the simple chat server, Ben!
Now, go play and have fun.
Summary
This article was inspired by my time at Red Hat when I was remote debugging an operator in an OpenShift cluster. It allowed us visibility into a bug that we were fixing that was caused when a library dependency was upgraded and broke one of the APIs.
There was another API in the library that seemed to fix the issue, but the library maintainers had added a comment that it had been deprecated. Also, the word “seemed” was doing a lot of heavy lifting there, since at the time we didn’t have the access needed to the operator in the cluster and the fix was untested (but appeared to work). The position of a senior member of the team was to check the fix into version control anyway so we could move on.
I was adamant that we needed to properly test and debug the issue and not use the deprecated API but another one that the maintainers recommended. However, the recommended fix used two new structs that the deprecated API did not use, and this was further reason to dig in and do the hard work that was finding out how to properly test the running code in a running cluster.
I patiently explained to the team why it was not a good idea to proceed without first gaining visibility and access to the operator on the cluster and then thoroughly testing it with unit tests and why it was a very bad idea to ever check in untested code. You’d think this conversation wouldn’t need to happen, but I assure you it did. More than once.
At the end of the day, it took an afternoon to get access to the pod and test the new API, including walking through the code and understanding why it was needed. It eliminated all the guess work and anxiety that naturally comes from doing something half-assed when the root cause hasn’t been properly diagnosed and understood, and it made us all feel warm and safe. Yay.
Lastly, and without doing into detail, I would be remiss if I didn’t add that any tension that arose from this incident was between those of us that find Agile horrid and repugnant and those of us that think that Agile is the bee’s knees.
The end.