A few weeks ago, we in Kurio, are developing a new feature. It’s a recommended/targeted push notification. It’s different from our current push notification system in which the push notification is sent to all users based on the data we had. With this feature, every user may get a different push notification.
Our old system if drawn in picture could be like this below:
This is how the normal push notification works. We chose an article to push based on the data we had about that article, then initiate a push notification from the dashboard, which will trigger the “push management” service. Each user will get the same push notification based-on the push.
Then we decided to improve it by sending a targeted push, so that each user may get a different push notification based on what they previously read.
The process will look like :
Kurio Targeted Push Notification
This new system looks more complex , because it uses an external service to provide the user_id
and article_id
based on users’ read.Since it takes some time to process all the users to get a targeted push notification, and the data would be so big, we decided to use Google Cloud PubSub. By using PubSub, we can start sending notification right after we process each user.
The challenge we had when we decided on targeted push notification is to integrate with the external targeting system. The external targeting system might generate the tuple of user_id
and article_id
on-demand. And that targeting system might take some time to do some processing. Then we need to define how we want to consume that data as it might be large.
With the challenges in mind, we decided to use Google Cloud PubSub.So that the external targeting system may take its time and just publish the tuple of user_id
and article_id
whenever its ready. Then we only need a subscriber to listen to that topic.
Our worker then gets the necessary data to build the notification. The data we need, on this case, is the detail of the article (title, thumbnail, etc.) and user’s device info (push token).
Thanks to our microservice architecture, we can get the article’s detail from a service that managed that data. And for the user’s device info, we already have it on our own database.
So after we finish the code, we try the system in the production environment. For a few hundreds of users the system still running well. After expanding the number of users, the disaster happens.
We got Too Many Connection
error. This happens as we spawn a goroutine for each message. As the number of goroutine increases, the number of connection to Mysql increases and it finally hits the limit.
After realizing the error, then we set a maximum connection pooling to database from golang itself. We set the maximum Open Connection and maximum Idle Connection. Read here: http://go-database-sql.org/connection-pool.html
Kurio Connection Pooling
dbConn.SetConnMaxLifetime(1 * time.Minute)dbConn.SetMaxIdleConns(10)dbConn.SetMaxOpenConns(250)
When setting this, the things we must know is not to set the maximum open connection bigger then the Mysql could handle. In our instance, our database could handle up to 300 connections fine.
After setting the connection pooling, now we handle the error for Too Many Connections
. But then we realize something weird happen. Our message queue already empties in the pubsub machine, but our worker still running processing data. And it took a long process. After finding the cause, we know that library we used in golang has the default Maximum Outstanding Message : 1000
This outstanding message means, our projects accept message up to 1000 messages. This explains why the Subscription already emptied but our worker still runs for a long time.
To reduce the queue in the worker, we figured that our outstanding message is too large than the connection to the database. And it makes our go routine grow and grow larger. Because we only test using around 60K messages, the goroutine might still be able to handle since goroutine is small and cheap in resources. But what happens next if we increase the number of users significantly?
Before enhanced Max Outstanding Message in Pubsub Client
To handle this, we set the maximum outstanding message that could be processed by our push worker. By doing so, our queue doesn’t grow too large as the memory footprint will grow as well.
Reduce the Max Outstanding Message
Because the messages were coming in faster than querying to DB for getting the active devices based on each user, we reduce the maximum outstanding message to 100, 10% from the default.
This helps us to reduce the number of goroutine running in our worker.
After making those improvements, we then tried again. It’s running well now that we optimized stuff. But then there are other problems a new and an unexpected one.
After running with the improvements, we found another issue. Our system is push management service, and therefore, besides sending a push notification to GCM/APNS, we still have an API that serves a list of past notifications on the API side.
This service is being used by both our dashboard and mobile app. The issue is that when our worker is sending the targeted push, we couldn’t access that API. It would just keep waiting until it finally timeout.
We started to being confused about what causes it. After some investigations, we finally realized that it only happens when the worker is sending the targeted push notification. It turns out the workers have used up all the connection pool to Mysql and left the API waiting for a connection from the pool.
To handle that issue, with the delivery target in mind (a.k.a. deadline), we then decided to separate the Mysql client for the API and worker so that they both get their each connection pool independently.
So now we have 2 Mysql client in a single project
Create 2 Mysql Client Connection
Instead, we use a single client to used by the entire system, we separate into 2 clients to the database.
Client 1 will be used by the API system for the serving dashboard and mobile apps. And Client 2 will be used by the push worker.
We then set the maximum connection in the API side lower than the worker side. Since the worker side needs a single connection for each user and we need to send it fast, it will need a lot more connections in the pool.
With this setup, the dashboard and mobile app can access the API even when we’re sending the targeted push notification.
Then for the fourth time, we try again. Running the targeted push notification.
Now we can call the API push management and the dashboard run well even we are doing targeted push notification.
But….
We figured that there are some weird behaviors. It’s not about the logic. Everything was running well right now. But for a single notification, it could take up to 3 hours for sending a message to cloud messaging. It’s really really so long. After looking at the logs, we can see that the query for getting device info per user is taking so long.
WTF!!!!
Then we started to investigate the query. Turns out that a single query took at least 30 seconds. Really?
The problem wasn’t because we don’t have an index. We already have an index in our table. But it’s like the Mysql not use the proper index, but also it’s because of our slow query.
SELECT `token` FROM device WHERE `id` IN ( SELECT max(`id`) as `id` FROM device WHERE `deleted_at` IS NULL AND `user_id` IN ("userIDs ") GROUP BY uuid )
To improve the query, we wrote a new query. Because we only need one latest device, so we changed it to:
SELECT `token`FROM device USE INDEX (user_id)WHERE `user_id` = ?AND `deleted_at` IS NULLORDER BY id DESCLIMIT 1
With this query, we forced the Mysql to use the proper Index. And it saved us a lot of time for queries. With the new query, it now only took about 30 milliseconds
Hooorayyy !!!!
Even after getting the speed of query we wanted, we also make some of the additional improvements.
Kurio Prepared Statement Querying
Instead of redoing the bunch of procedures every time we query (open connection, create a query, execute the query), we simplify it by using the prepared statement. We prepared the statement on the application started. Then inject the statement into a function that will be called by push worker.
In the main function, we open the connection and prepare the statement
dsn := "kurio:someofpassword@tcp(127.0.0.1:3306)/push_notification?parseTime=1&loc=Asia%2FJakarta"
dbConn, _ := sql.Open(`mysql`, dsn)
StmtGetDevicesByUserID, err:= dbConn.Prepare(queryGetDeviceByUserID)
// stmtGetDevicesByUserID : will injected to function GetDeviceByUserID
After prepared, the statement StmtGetDevicesByUserID
will injected to the function for retrieving the device. Then in that function, we just execute the prepared statement.
func (repo *Repository)GetDeviceByUserID(userID int64) *Device{
row:= repo.**StmtGetDevicesByUserID**
.QueryRow(userID)
device:= &Device{}err:= row.Scan(&device.Token)return device}
So when each worker got a new message from Google Pubsub, it now only pass the user ID. Because the query and the boilerplate have already been prepared.
For the HTTP call to Article service, we use some caching mechanism because there might be some users that get the same article. So to reduce the HTTP call to our article service, we added in-memory cache that will be expired for a certain of times (10 minutes for instances).
We used an open-sources in-memory golang cache here https://github.com/patrickmn/go-cache
And also make some configuration in the HTTP default transport
timeout := 2 * time.SecondkeepAliveTimeout := 100 * time.SeconddefaultTransport := &http.Transport{Dial: (&net.Dialer{KeepAlive: keepAliveTimeout,}).Dial,MaxIdleConns:100,MaxIdleConnsPerHost:600,}
client:= &http.Client{Transport: defaultTransport,Timeout: timeout,}
This is used for handling the HTTP call and timeout gracefully. The more explanation about this you could read in this good blog post by cloudfare here: https://blog.cloudflare.com/the-complete-guide-to-golang-net-http-timeouts/
After a few long weeks developing this feature, there are many things I learned and many things that can help me improve my code
Thanks to Andrew Ongko, Rifad Ainun Nazieb, Arie Ardaya Lizuardi and All Teams involved at Kurio.
If you think this worth enough to read, just share this on your twitter, facebook etc, so other people can read this. If you have a question or something, you could put a response below or just email me
Kurio Toolbox_Engineering-related notes from your friendly neighborhood, Kurio_toolbox.kurio.co.id
Implementing gRPC service in Golang_Guide to make gRPC service using clean architecture in golang._toolbox.kurio.co.id
Trying Clean Architecture on Golang_Independent, Testable , and Clean_hackernoon.com