Ah yes—“It works fine on my machine!” Perhaps the most famous saying in our industry. Even with the advent of containers that provide consistent environments across the SDLC, we still hear developers fall back to this claim when a defect is found. But in the end, if the code doesn't work in test or production, it doesn't work—even if it works locally. So as a developer, being able to deep dive into your containerized application to fix the problem—regardless of the environment—is a critical skill we must all learn. Let's look at a sample app that works locally, but fails when deployed, and how you might use the scientific method (you remember that from school, right?) to debug these failures in containerized apps. Our Example For our , we'll use a simple Flask application—deployed in a Kubernetes container—that displays the famous people that have used our app. It works flawlessly on our local system! example Now let's deploy this to our service and see if it’ll behave the same (hint: it won't). Heroku heroku container:push worker -a heroku-troubleshooting heroku container:release worker -a heroku-troubleshooting Drats! Foiled by the production environment powers that be. Now we can start digging into what happened and see what tools we have at our disposal. Our Approach When your app fails, it's helpful to take an unbiased and objective approach to triage the issue and understand the faults. One such classical approach is . Harnessed through hundreds of years of rigorous repetition and review, I've found that this method works great for modern debugging. the Scientific Method Too often developers and DevOps engineers take a "let's see what this does" scattered approach to debugging. We grab the first thing that comes to mind that might be wrong, change our code/config around, deploy, and see if that worked. And if that doesn't work, we just keep trying, sometimes forgetting what we've already tried, sometimes spending hours or even days going down a path that if we just stopped and thought about it for a minute, clearly is the wrong path. But if we use a more reasoned, structured approach - like the scientific method - our debugging process could be more productive, faster, and in the end, better. The scientific method, at its most basic, involves forming a hypothesis, conducting experiments, observing the results, and then refining your hypothesis. The scientific method as I apply it to debugging containerized apps looks like this: Ask "What should be happening?" Determine reasons that might explain why it's not happening Design a test to validate/invalidate those claims by tuning a config or using a tool Observe the outcome Draw a conclusion Repeat until the outcome is what should be happening The Scientific Method in Action Now let's see that approach in action. 1. Ask "What should be happening?" First, we need to understand what our application is trying to do. To start investigating, we need to understand the details of the page rendering. Let’s see how the Dockerfile describes our service: FROM python: WORKDIR /usr/src/app COPY requirements.txt ./ RUN pip install --no-cache-dir -r requirements.txt RUN echo >> healthy && \ cat /tmp/healthy ENV FLASK_ENV=development ENV FLASK_APP=myapp.py COPY myapp.py . CMD [ , ] 3 "healthy" /tmp/ "python3" "myapp.py" Pretty simple. We have a Python3 application, it installs some requirements, and then starts myapp.py. Let’s take a quick look at what our Python application is doing and then we can decide how to best proceed. flask Flask, Response os psycopg2 app = Flask(__name__) query = ( ) host = os.getenv( ) password = os.getenv( ) user = os.getenv( ) dbname = os.getenv( ) print(host) print(user) print(dbname) def callpsql(): conn = psycopg2.connect(dbname=dbname, user=user, password=password, host=host, port= ) cur = conn.cursor() cur.execute(query) results = cur.fetchall() cur.close() conn.close() results @app.route( ) def index(): response = callpsql() print( ) results = +os.getenv( )+ row response: results = results+ +str(row[ ])+ +str(row[ ]) results @app.route( ) def health(): f = open( , ) print(f) health = f.readline().rstrip() resp = Response(health) health != : resp.status= resp.headers[ ] = health resp __name__ == : port = app.run(debug=True, host= , port=port) from import import import "select * from actor order by random() limit 5;" "host" "PSQL_PASSWORD" "PSQL_USERNAME" "PSQL_DBNAME" "5432" return '/' "Getting some juicy logs" "I've sold to these famous people from server " "DYNO" ": <br>\n" for in "<br>\n" 1 "\t" 2 return '/health' "/tmp/healthy" "r" if "healthy" 500 'health-check' return if "__main__" 5000 '0.0.0.0' Another pretty simple application. This takes a few environment variables for authentication and executes a SQL query against a Postgres database when the root page is called. 2. Determine reasons that might explain why it's not happening Now that we know what the application is trying to do, we move to the next step—explain why it's not happening in our new environment. Here we can look at the Heroku logs --tail command that’s suggested on the error page. This should give us a good idea of what our service is doing at any point in time and also help us form a hypothesis as to what might be the problem. $ heroku logs --tail -a heroku-troubleshooting ... T22: : + : heroku[router]: at=error code=H14 desc= method=GET path= host=heroku-troubleshooting.herokuapp.com request_id= b76-b372 b-e8611c020fd5 fwd= dyno= connect= service= status= bytes= protocol=https … 2020 -08 -22 45 34.805282 00 00 "No web processes running" "/" 09292 -49e9 -858 "xx.xx.xx.xx" 503 We see the error “No web processes running” appear every time we load the page. It seems as though Heroku doesn’t know that we’ve started Flask. Great! We have our hypothesis - "Flask probably isn't running in Heroku's environment." 3. Design a test to validate/invalidate those claims by tuning a config or using a tool Now let's move to the next steps: create a test to validate what we think is happening, observe the outcome, and draw a conclusion. So let's test our hypothesis - we'll hop onto our dyno and validate that the Flask service is (or is not) actually running. Our Dockerfile specifies python should be a running, listening service. heroku ps:exec -a heroku-troubleshooting Establishing credentials... error ▸ Could not connect to dyno! ▸ Check the dyno is running if with `heroku ps' 4. Observe the outcome We released our container, so it should be running. However, the logs from our test have shown us that we couldn’t connect to the dyno that should be running the code. Executing the suggested command may give us another clue. $ heroku ps -a heroku-troubleshooting No dynos on ⬢ heroku-troubleshooting 5. Draw a conclusion Well, this is embarrassing. The scientific method conclusion? We were wrong in our initial assertion that our released code left us with a running dyno to serve the code. We definitely pushed and released our image but we didn’t add any dynos to our application. It appears we have validated that the application is not running and that our hypothesis was incorrect. 6. Repeat until the outcome is what should be happening So we move on to the next step, which is repeat the process—and we'll continue to repeat this process throughout this article until our problem is solved. So let's form a new hypothesis and test it. Our new hypothesis, based on the last test, is that with the dyno running, we should be able to reach our Flask application from the outside world. We'll scale a dyno up to start the application and check the logs again to test that assertion. $ heroku dyno:scale -a heroku-troubleshooting worker= Scaling dynos... done, now running worker at :Standard X $ heroku logs --tail -a heroku-troubleshooting T23: : + : app[api]: Scaled to worker@ :Standard X by user T23: : + : heroku[worker ]: Starting process command T23: : + : heroku[worker ]: State changed starting to up T23: : + : app[worker ]: None T23: : + : app[worker ]: None T23: : + : app[worker ]: None T23: : + : app[worker ]: * Serving Flask app (lazy loading) T23: : + : app[worker ]: * Environment: development T23: : + : app[worker ]: * Debug mode: on T23: : + : app[worker ]: * Running on http: T23: : + : app[worker ]: * Restarting stat T23: : + : app[worker ]: * Debugger is active! T23: : + : app[worker ]: * Debugger PIN: 1 1 -1 2020 -08 -22 29 05.207217 00 00 1 -1 2020 -08 -22 29 18.202160 00 00 .1 with `python3 myapp.py` 2020 -08 -22 29 18.760500 00 00 .1 from 2020 -08 -22 29 21.559386 00 00 .1 2020 -08 -22 29 21.559547 00 00 .1 2020 -08 -22 29 21.559548 00 00 .1 2020 -08 -22 29 21.559549 00 00 .1 "myapp" 2020 -08 -22 29 21.559556 00 00 .1 2020 -08 -22 29 21.559556 00 00 .1 2020 -08 -22 29 21.636027 00 00 .1 //0.0.0.0:5000/ (Press CTRL+C to quit) 2020 -08 -22 29 21.637965 00 00 .1 with 2020 -08 -22 29 21.897821 00 00 .1 2020 -08 -22 29 21.929543 00 00 .1 485 -744 -571 There we go. Now the application has started! It's time to test the site again to observe if it’s working. Application error Our conclusion now, is it’s definitely not. We should go over our commands again and see if the status has changed at all so that we can create a new hypothesis. $ heroku logs -a heroku-troubleshooting T23: : + : heroku[router]: at=error code=H14 desc= method=GET path= host=heroku-troubleshooting.herokuapp.com request_id= d3cbafa-cb31 -bfc6 ee18ce72eb0 fwd= dyno= connect= service= status= bytes= protocol=https $ heroku ps -a heroku-troubleshooting === worker (Standard X): python3 myapp.py ( ) worker : up / / : : (~ m ago) 2020 -08 -22 32 12.098568 00 00 "No web processes running" "/" 4 -4452 -7 "xx.xx.xx.xx" 503 -1 1 .1 2020 08 22 16 31 59 -0700 3 It looks like we’re running a service but we can’t get any more data using only the logs. Hypothetically we can get into the service and make sure it’s working. $ heroku ps:exec -d worker -a heroku-troubleshooting Establishing credentials... error ▸ Could not connect to dyno! ▸ Check the dyno is running .1 if with `heroku ps' Here we can observe that even with the dyno running, we can't connect to it. Looking back at our Dockerfile and some documentation, we can conclude that there are a needed for the Heroku environment to support exec into the deployed containers. We should add a few utility commands we might need while we’re in here to test future hypotheses while we test if this new configuration will allow us to connect. few requisites ... RUN apt-get update && \ apt-get install -y \ curl \ openssh-server \ net-tools \ dnsutils \ iproute2 RUN mkdir -p /app/.profile.d && \ touch /app/.profile.d/heroku-exec.sh && \ echo >> .profile.d/heroku-exec.sh && \ chmod +x /app/.profile.d/heroku-exec.sh ... '[ -z "$SSH_CLIENT" ] && source <(curl --fail --retry 3 -sSL "$HEROKU_EXEC_URL")' /app/ One more push and release later and we've validated our conclusion was correct and we can finally connect: $ heroku ps:exec -d worker -a heroku-troubleshooting Establishing credentials... done Connecting to worker on ⬢ heroku-troubleshooting... The programs included the Debian GNU/Linux system are free software; the exact distribution terms each program are described the individual files /usr/share/doc .1 .1 with for in in /*/copyright. Debian GNU/Linux comes with ABSOLUTELY NO WARRANTY, to the extent permitted by applicable law. ~ $ Now that we’re in the application, we can check our hypothesis that the port is listening as it should be and that our application is responsive from within the container itself. We can utilize “net-tools” netstat command as a test to show us a list of ports that are open on our container and listening for calls to open a socket. ~ $ netstat -apn | grep | grep LIST (Not all processes could be identified, non-owned process info will not be shown, you would have to be root to see it all.) tcp : :* LISTEN /python3 5000 0 0 0.0 .0 .0 5000 0.0 .0 .0 3 Our analysis shows that the port is open as expected and our hypothesis is correct. Because nothing has changed, we must design a new test. “Curl” can act as our command-line internet browser to show us a simplified view into the port we now verified is open and listening properly. ~ $ curl -I localhost: HTTP/ INTERNAL SERVER ERROR Content-Type: text/html; charset=utf X-XSS-Protection: Server: Werkzeug/ Python/ : Sun, Aug : : GMT ~ $ curl localhost: ... psycopg2.OperationalError: could not connect to server: No such file or directory Is the server running locally and accepting connections on Unix domain socket ? ... 5000 1.0 500 -8 0 1.0 .1 3.8 .5 Date 23 2020 00 27 14 5000 "/var/run/postgresql/.s.PGSQL.5432" Ah-ha! Our analysis of this error proves that the calls are failing because we’re getting a 500 error in our application. Our new hypothesis is that this is surely at least one reason why Heroku won’t send traffic to our application. If we look back at our Python code, we can see that it requires some environment variables. In our local host we have them exported as part of a development virtual environment. We can now conclude these need to be added to our application in Heroku as well. Our new hypothesis becomes "The service should work as expected in both the local and production environments." To test this hypothesis and pick up the new change, we'll need to restart our worker which should let the application pick up the new environment variables. $ heroku ps:restart worker -a heroku-troubleshooting Restarting worker dyno on ⬢ heroku-troubleshooting... done $ heroku ps:exec -d worker -a heroku-troubleshooting Establishing credentials... done Connecting to worker on ⬢ heroku-troubleshooting... The programs included the Debian GNU/Linux system are free software; the exact distribution terms each program are described the individual files /usr/share/doc .1 .1 .1 .1 with for in in /*/copyright. Debian GNU/Linux comes with ABSOLUTELY NO WARRANTY, to the extent permitted by applicable law. ~ $ curl -I localhost:5000 HTTP/1.0 200 OK Content-Type: text/html; charset=utf-8 Content-Length: 148 Server: Werkzeug/1.0.1 Python/3.8.5 Date: Sun, 23 Aug 2020 00:50:35 GMT ~ $ curl localhost:5000 I've sold to these famous people from server worker.1: <br> <br> Julianne Dench<br> Sylvester Dern<br> Greta Keitel<br> Dustin Tautou<br> Penelope Monroe We’ve now validated that the local container is working as we expect after re-running our previous tests. Still, our logs and browser conclude that the app is still not functioning as expected. Logs from our internal curls T00: : + : app[worker ]: - - [ /Aug/ : : ] - T00: : + : app[worker ]: - - [ /Aug/ : : ] - 2020 -08 -23 50 35.931678 00 00 .1 127.0 .0 .1 23 2020 00 50 35 "[37mHEAD / HTTP/1.1[0m" 200 2020 -08 -23 50 41.807629 00 00 .1 127.0 .0 .1 23 2020 00 50 41 "[37mGET / HTTP/1.1[0m" 200 Logs from our external browser T00: : + : heroku[router]: at=error code=H14 desc= method=GET path= host=heroku-troubleshooting.herokuapp.com request_id=a9421f09 f2 -f2f63ffe5123 fwd= dyno= connect= service= status= bytes= protocol=https 2020 -08 -23 50 47.151656 00 00 "No web processes running" "/" -94 -4822 -9608 "68.251.62.251" 503 Application error With the application working in the container, it seems that everything should be working fine, hypothetically. Utilizing our prior re-run of tests and taking a closer analysis of the error “No web processes running” gives us another hint. There are two types of dynos, and we’ve deployed this as a “worker” dyno. Looking at the , we can conclude that perhaps we’ve deployed to the wrong resource type that does not get an ingress. Hypothetically Heroku should provide a dynamic external port if we configure the dyno as "web" instead of "worker". Time to test redeploying as a “web” dyno. documentation heroku ps:scale worker= -a heroku-troubleshooting heroku container:push web -a heroku-troubleshooting heroku container:release web -a heroku-troubleshooting heroku ps:scale web= -a heroku-troubleshooting 0 1 Browsing to the site still gives the same conclusion as our initial tests, but now our observed logs are giving us our next thread to investigate. T01: : + : app[web ]: * Serving Flask app (lazy loading) T01: : + : app[web ]: * Environment: development T01: : + : app[web ]: * Debug mode: on T01: : + : app[web ]: * Running on http: T01: : + : app[web ]: * Restarting stat T01: : + : app[web ]: * Debugger is active! T01: : + : app[web ]: * Debugger PIN: T01: : + : heroku[web ]: R10 (Boot timeout) -> Web process failed to bind to $PORT within seconds launch T01: : + : heroku[web ]: Stopping process SIGKILL T01: : + : heroku[web ]: Process exited status T01: : + : heroku[web ]: State changed starting to crashed T01: : + : heroku[web ]: State changed crashed to starting “Web process failed to bind to $PORT” 2020 -08 -23 03 12.890612 00 00 .1 "myapp" 2020 -08 -23 03 12.890619 00 00 .1 2020 -08 -23 03 12.890620 00 00 .1 2020 -08 -23 03 12.905580 00 00 .1 //0.0.0.0:5000/ (Press CTRL+C to quit) 2020 -08 -23 03 12.906860 00 00 .1 with 2020 -08 -23 03 13.661451 00 00 .1 2020 -08 -23 03 13.662395 00 00 .1 196 -723 -097 2020 -08 -23 04 07.990856 00 00 .1 Error 60 of 2020 -08 -23 04 08.016515 00 00 .1 with 2020 -08 -23 04 08.224860 00 00 .1 with 137 2020 -08 -23 04 08.290383 00 00 .1 from 2020 -08 -23 04 08.293139 00 00 .1 from We can conclude that we need to bind our process to a variable rather than our default Flask port. This is simple enough to fix in our Python application. If we have the correct variable mapped in our production environment, and our application is utilizing that as its port configuration, then we should have a fully functioning site. To test this we can pass in the environment variable from the host and use it in our app.run argument. myapp.py __name__ == : port = int(os.getenv( , )) app.run(debug=True, host= , port=port) if "__main__" "PORT" 5000 '0.0.0.0' And there you have it! Our analysis of the test shows that the app works both locally and when deployed. One final validation to ensure that our changes continue to work in our development environment as well. Putting it Together In this workflow, we used the scientific method to investigate an application that was failing to render in production. Debugging a managed infrastructure is a daunting task, but only because of the sum of its parts. By breaking down each layer of the product, reviewing how it behaves with standardized tooling, taking clues from the outputs of the logs and application outputs, and reviewing various product documentation, we can break down these complex structures to meaningful and simple conclusions that will resurrect any service. If you want to further analyze the app that we deployed, you can see it . Next time you find yourself investigating a difficult problem, remember to take each component one layer at a time and apply methodical, scientific, approaches to the process. here By doing this you can remove much of the personal bias that would otherwise potentially derail you into a wild goose chase of StackOverflow articles and random community forums. Also published at https://dzone.com/articles/using-the-scientific-method-to-debug-containerized
Share Your Thoughts