Many IT companies have adopted the practice of on-duty duty. An engineer is on duty, and his duty lasts for a shift. Usually, the shift lasts one day or one week. There are cases of increasing or decreasing this period, but this is rare. The engineer's duties in this role vary from company to company, but there are some common points.
We'll talk about how to be well-prepared for duty as an engineer in this article.
If incidents or fires occur on your shift in a combat environment, it does not mean you are a terrible engineer. To be a good duty engineer, if problems occur, you need to solve them as quickly and cost-effectively as possible. A well-trained engineer understands where they can handle himself and where it's worth escalating the situation further. They should know when and whom to call for help. They must not allow, if possible, the same problems to be repeated.
It is helpful to take advantage of the experience of yourself and the team to better prepare for duty.
It's a good idea to start by looking at what documentation already exists for handling incidents. Is there documentation at the company or team level? It is essential to know where it is and be familiar with it. If it doesn't exist, could you try to initiate the process of creating one?
Then it is worth checking what incidents have already happened and how they are recorded in the knowledge store. Are there any postmortems and are there tasks to be fixed? Do you know if these tasks are being done? If tasks for fixing the situation are created, but not implemented, this is a reason to discuss this in a meeting with your manager.
Do you know if the duties of the engineer on duty are recorded? What are they responsible for, and what is not? If there is no such documentation or understanding, it would be helpful for the whole team to have one.
The toolkit for the on-duty engineer is specific and somewhat different from the developer's everyday toolkit. The main tasks that arise during firefighting are those that cannot be solved by existing queries
Let's look at these tools in more detail:
It will be very useful if you have a set of ready-made scripts at hand at the time of the fire. They should be as simple and straightforward as possible. Yes, you can probably write them very quickly, but when you have limited time, it's great to have a ready-made set that allows you to think only about the crash and not how to implement the same tying. The scripts below are just a representation of the potential structure of such scripts. Of course, your code should be tested before the incident and be as simple and straightforward as possible. It is helpful to have the following scripts:
A script that generates a file from the data provided. These can be obtained from other commands or by making an independent request. An example of such a script in Python is shown below.
import csv
def modify(filename):
tmpFile = "tmp.csv"
# Reading file with data and creation of output file
with open(filename, "r") as file, open(tmpFile, "w") as outFile:
# Create reader for initial file
reader = csv.reader(file, delimiter=',')
# Create writer for output file
writer = csv.writer(outFile, delimiter=',')
# Read header line
header = next(reader)
# Write header line
writer.writerow(header)
# Process initial file line by line
for row in reader:
colValues = []
# Process each column of each line
for col in row:
# Let for example transform all columns to lowercase
colValues.append(col.lower())
# Write modified line to final file
writer.writerow(colValues)
filename = 'sample_data.csv'
modify(filename)
A script that will call the required endpoints with the specified parallelism. It doesn't have to be anything complicated. Below is an example of a simple JavaScript code that will generate a sh file with the specified parallelism that can help you with that. Yes, we don't have result handling here, but it's not always needed, and you can modify your toolkit with a result handling version if required. For example, we have a file that reads and writes the whole data, but you can create stream scripts for huge files.
const fs = require('fs');
const initialFilePath = 'sample_data.csv';
const outputFilePath = 'sample_script.sh';
const amountOfParallelRequests = 5; // Remember about the throughput and the bandwidth of your services
const delimiterForCSV = ',';
// Read the initial file and split it by lines
// You could transform it to an object if it's relevant to your situation
let initialFile = fs.readFileSync(initialFilePath).toString().split('\n');
// Prepare boilerplate for sh script
let outputString = '#!/bin/bash\n\n';
// Write data with parallel execution
// Skip header for CSV
// The code for parallel requests was received from https://serverfault.com/questions/456490/execute-curl-requests-in-parallel-in-bash
// and you could implement your version instead
for (let i = 1; i < initialFile.length; i++){
let line = initialFile[i];
if (!line) {
continue;
}
let processedLine = line.split(delimiterForCSV); // We don't implement processing of errors here
// Let's suggest that the necessary for request value lies in second column
let desiredValue = processedLine[1];
if (desiredValue === undefined) {
console.error('We have a trouble ' + line);
}
outputString += `curl -s -o foo http://example.com/file${desiredValue} && echo "done with ${desiredValue}" &\n`;
if (i % amountOfParallelRequests === 0 || i === initialFile.length - 1) {
outputString += '\nwait\n\n';
}
}
fs.writeFileSync(outputFilePath, outputString);
// Indicate the success
console.log('Success');
A script that gets something from a database or service and puts the converted result back or maybe calls another endpoint. I suggest implementing it yourself, not forgetting about authorization and appropriate usage scenarios.
In addition to tools and experience, a certain amount of knowledge is helpful when you go on duty.
I would like to know about the logs and metrics of your services. Where do they go and how do you get there? Would you happen to know how to use these tools? When you get a call at night from an on-call service that tells you that your service is down - it's in your best interest to quickly discover what's going wrong. To do that, you need to know precisely where your metrics and logs store and what to look at first.
How to postpone alerts? After analyzing the incident, it often comes out that the current accident may be waiting for the morning, so it is good to understand how to postpone the alert. Not to close, because in the case of some operations you will be notified again, but precisely to postpone. You should remember to deal with and correct the situation or alert as soon as you start your typical work day.
Where do the contacts lie or how do you get in touch with colleagues or members of other teams? There must be a clear tool or knowledge in your head - who understands/should know about the situation when it happened. The experts can help you solve the incident, and the stakeholders need to know that something is going wrong. You must have access to their contacts, ideally their phone, because many people turn off notifications from office chats outside of business hours.
How do you get access to the production/database and increase access if access levels exist? If you don't have access, you need to know who to go to or what to do to get the access you want.
How do you get the code into production quickly? Sometimes problems require a quick change in the service code in production. In general, this is rightly considered bad practice, but often in an emergency, this is not the case. Sometimes you don't want to wait for long E2E tests but need to quickly get the code into the production environment. I would like you to understand how to do this.
What data is stored in the database? Are there any schemes of data movement within the product and between services? If you need to interact with the database, it's good to know how the data is organized in a particular service, where it comes from, and who uses it. This will allow you to deal with problems, if there are any, sooner.
In addition to tools and experience, a certain amount of knowledge is helpful when you go on duty.
Even in companies and teams with an excellent engineering culture, on-duty accidents and fires happen. To avoid this, the team must make every effort to improve current processes and products. Still, each engineer must also be prepared that an accident will happen and urgently have to deal with it. To do this, it is worth using all the accumulated personal and team experience. Knowing about the organization and services and being confident in your toolkit is worth knowing.
https://zach-gollwitzer.medium.com/the-ultimate-bash-crash-course-cb598141ad03
https://www.youtube.com/watch?v=oxuRxtrO2Ag