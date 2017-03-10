Site Color
Imagine you have to rewrite an existing web service to move to a new payment gateway (PSP or payment service provider) due to various business use cases. Your first thought might be to replace old with the new one in its entirety and roll that out. That is a naive approach especially when you are working with payment gateways which have their own SLAs, agreements with acquiring banks, risk and fraud detection softwares, etc. which makes this process riskier in terms of conversions, revenue, customer retention and eventually business. In this blog post, we discuss the approach that we took to mitigate risks while switching payment gateways and why it was so important.
Our old payment service is written in Python 2 and is highly coupled with the old payment gateway (payment service provider). When we first dissected the problem, we thought that integrating a brand new payment gateway within the same flows, urls and Python Bottle views would be trivial. Once we started working on the first POC, we realized that we were already generating a lot of spaghetti code as API flows for these two payment gateways were entirely different. While the old payment gateway API design had to rely heavily on Redis and Gevent to optimize user and payment flows, there was absolutely no reason to re-introduce that dependency with APIs from the new one.
A/B Tests Are Good As Long As They Are Simple
A/B tests are very powerful in making product decisions. At dubizzle OLX, these tests have traditionally been more oriented towards the user facing components such as page flows, page components, placements and products.
However, these tests complicate things when you want to test underlying systems that are highly dependent on each other.
While we were still working on our POC, we realized that the A/B test we wanted to perform should not be done using tools such as Optimizely even if we somehow managed to integrate the new gateway within the same views and user flows. Here is why:
if cookie == ‘OLD’:
use old payment gateway API
else:
use new payment gateway API
Note that once a user lands in bucket A (web service talking to old payment gateway), that web service needs to ensure that payment flow for that user continues on the same payment gateway on which it was first started. For instance, web service should never initiate transaction on old payment gateway and try to finalize it on the new one. This problem can be resolved using sticky sessions which are supported by Optimizely and HAProxy both.
Once we had some idea around the problem we were trying to deal with, we decided to write a new payment service integrated with the new gateway and compare the performance through a 50–50 split A/B test. The whole A/B test had to fulfill at least the following:
Since Optimizely was out of the equation, we decided to leverage HAProxy for running the test.
HAProxy is a powerful layer 4 and layer 7 load balancer with an extensive set of features. One such feature is its ability to enable cookie based persistence in a backend which happens on layer 7.
We configured our HAProxy to have three backends:
To give you a taste, here is an example of our HAProxy backends:
You might be thinking why we had static backends? This would become clear once we come to HAProxy
frontend so lets discuss the
main backend first.
We went for weighted round robin approach for load balancing requests between
old-service-old-pg and
new-service-new-pg . This makes sense for an A/B test where you just have to split the traffic to A and B buckets with the caveat that any request that lands on bucket A should never land on bucket B for that session duration. We neatly achieved this through HAProxy
cookie directive which is very powerful. With our assumption that any user who initiates a transaction would complete it within 2 hours, we told HAProxy to discard a session cookie after 2 hours and generate a new one based on what round robin decides for that request.
This solved two very strong use cases for us:
There is a small vulnerability in our setup that you probably wouldn’t have noticed yet. Consider the following diagram:
A user starts a transaction on the old payment service at 45 min mark after receiving the cookie. He then proceeds to 3-D Secure page of the bank at 1hr 59th minute and is redirected back to our success url after the 2hr mark. Since HAProxy is configured with cookie
maxlife of 2hrs, it will discard the session cookie and try to insert a new one on redirect to success url. If we are unlucky enough, round robin might tie the new session to the new payment service which would not know how to handle a success redirect that was configured by the old service.
In our case, we chose to ignore this problem because we have seen earlier that users who initiate a transaction would usually complete it under two hours. But can you imagine the severity of this problem if we had set cookie
maxlife to 1 min, for example?
Lets come back to our discussion around why we had
old_service and
new_service backends along with
main. HAProxy is usually configured with a
frontend proxy that handles all ACLs (Access Control Lists), which in our case was configured like this:
These webhooks and endpoints that you see in the configuration are specific views to handle requests originating from the external gateways. Since old and new services cannot speak to each other’s payment gateway, putting them under a round robin load-balancer would mean 400s or 404s for as much as 100% of the requests. Also, since these requests are coming from payment gateways, there is no need for any sort of persistence because these do not incorporate flows spanned across different views (All requests are touch and go with a 200).
The best place to fix this problem was on load-balancer and we did this by defining ACLs to control request flow to appropriate application servers through
old_service and
new_service static backends. To put it in simple words, this would be a conventional conversation between payment gateway originated request and HAProxy that would eventually hit the appropriate application servers:
Payment Gateway: Hey, I am an internal request from old payment gateway and I need to tell you that I have successfully received the payment.
HAProxy: Hey, I recognize you! You should pass through door
old_service to reach your destination application.
Destination Application: Hey, I know how to process you! Lets activate the order for this user. I will give you the number 200 in return!
With all conditions in place, how would you test such a complicated A/B test?
HAProxy provides you with very precise logs to debug such complicated systems. Lets get through some examples from our actual A/B test logs to understand this.
Notice flags
NI,
VU and
VN for each request made for the same order id. These flags give a lot of information on how persistence was handled by client, the server and by HAProxy and are one of the most important indicators you would look at while testing and debugging.
Quoting HAProxy docs here:
— : Persistence cookie is not enabled. This is the case where the request path is
webhook-new and it doesn’t makes sense to put it under the A/B test.
NI : No cookie was provided by the client, one was inserted in the
response. This typically happens for first requests from every user
in “insert” mode, which makes it an easy way to count real users. This is where round robin would decide the test bucket for the user.
VU : A cookie was provided by the client, with a last visit date which is
not completely up-to-date, so an updated cookie was provided in
response. This can also happen if there was no date at all, or if
there was a date but the “maxidle” parameter was not set, so that the
cookie can be switched to unlimited time.
VN : A cookie was provided by the client, none was inserted in the
response. This happens for most responses for which the client has
already got a cookie. This is how HAProxy finds the current bucket for the user and directs him to the correct backend.
Notice that once HAProxy decides server
new-service-new-pg and sets a cookie, all subsequent requests from that user are directed to
new-service-new-pg through the
main backend.
For every thing else that matches one of the ACLs, HAProxy directs the request without setting any cookie. This also ensures that A/B test results are not skewed by HTTP calls from computers rather than humans.
Note that DNS and Edge Tier are common across all our micro services. All the A/B test magic happens after the traffic passes through HAProxy load balancers.
While this has worked for us perfectly, there are more complicated load balancing use cases which could potentially require a lot more parameters than just session persistence. Shopify has covered this in an excellent blog post here by leveraging Nginx and OpenResty.
This A/B test was a major team effort across the dubizzle infrastructure team and product engineering. Thanks for everyone involved!
Hope you enjoyed this post. Feel free to add comments or ask questions. You can also reach me out on my twitter handle: mrafayaleem
PS: This post was also published on dubizzle engineering blog.