1st-hand & in-depth info about Alibaba's tech innovation in AI, Big Data, & Computer Engineering
Picking the right open source project can feel like a game of minesweeper — one wrong move, and boom! Your system’s dead.
There’s a popular principle in the field of software development — DRY, i.e., don’t repeat yourself. The main purpose of open source projects is to share, to avoid having everyone creating wheels when they could be making engines. For an industry as fast-paced as Internet and information technology, speed is everything. Open source resources save manpower and time while accelerating the pace of business development. What’s not to love?
However, the ground reality isn’t always so pretty. Open source-based projects bring their own set of problems, such as downtime or data loss. In some situations, the sheer variety of open source projects can create a real headache. Should one go with MySQL, or PostgreSQL? Memcached or Redis? Angular or React?
Here are some thoughts, observations, and personal lessons based on my experience of working with the UC browser team and developing open source projects.
Since a lot of open source options tend to be similar (and yet so different at the same time!), we were perpetually faced with dilemmas when making picks. To simplify the process and ensure the choice is suited for the end purpose, we chose to pay more attention to how it would meet platform needs, rather than what was simply the best.
During the course of working on a social platform, we discovered an open source solution called TT (Tokyo Tyrant), a perfect replacement for Memcache with cache as well as MySQL with persistent storage. Thinking it was an awesome solution, we used it extensively — only to later realize that it could not completely replace MySQL, and that its seemingly fantastic features were very buggy and required a long time for familiarization.
If the business requires 1,000 TPS, then there’s effectively little difference between a solution with 20,000 TPS and another with 50,000. System architectures can be reconstructed in the future for higher TPS numbers, but UNIX programming philosophy dictates, premature optimization is the source of all evil.
New open source projects will often claim to be much better than previous ones in terms of performance, functionality, and new features. Tempting as they may be, all of them will carry bugs. Programming talent is no guarantee of a project being perfect in its very first version. Developers working on Windows, Linux, and MySQL are among the best in the industry, yet they still issue updates and bug fixes for their projects. Young open source projects introduce new risks when applied to production environments, the least serious of them being system downtime (already fairly serious) and the worst case being irreversible data loss (a complete nightmare).
The time when we were using TT, a power failure occurred that resulted in file damage that could not be recovered by restarting the system. Thankfully, our daily backup practices saved the day, but we still lost the data for the day the power failure occurred. Immediately after, we spent a lot of time going through the source code and restoring the data by writing new tools by ourselves. Thankfully, the data wasn’t financial in nature; otherwise we would’ve been in big trouble.
So, how does one check whether the open source project is mature? Look for the following:
1. Version number
Unless completely necessary, try to avoid projects with version 0.X, and instead go one that has at least 1.X version out. A higher number is always a safer bet.
2. The number of companies using the project
Open source projects usually list the companies using them right on the project home pages. Look for any big established companies using the project, and the overall number.
3. Community engagement
See if the community using the project is active, the number of posts, replies, response time to questions.
When we select an open source project, we usually focus more on technical indicators such as performance and reliability, rather than its capacity to facilitate O&M. However, operation and maintenance is an indispensable part of applying any solution to the online production environment. Without this, any instance that involves troubleshooting can spiral into Hail Mary time.
You can examine a project’s operation and maintenance capabilities by studying the following aspects:
1. Whether the open source solution’s log is complete: Some open source solutions have very brief log files with only a few lines for starts and stops. This makes it difficult to investigate and solve issues when a problem pops up.
2. Whether the open source solution has maintenance tools: Command lines and management consoles help in viewing the operation condition of the system
3. Whether the open source solution has real-time detection abilities: This includes fault detection and recovery, alarms, switching, etc.
Many people simply copy and paste open source code and deploy them to online applications after watching a few demos. This is a surefire way of endangering the health of the main system and its reliability, and is akin to getting behind the wheel of a car after flipping through the driver’s manual.
A couple of anecdotes come to mind. Once, one of the team decided to use elasticsearch on a whim without fully figuring out the inverted index and keeping the default configuration values. They pushed it live on the first run, and realized that the node ping times were too long and the elimination of the abnormal nodes was too slow. The result? The entire site broke down.
Another time, many teams began using MySQL without fully studying it. Business departments eventually complained that MySQL was running too slow. The culprit? The most critical parameters (such as innodb_buffer_pool_size, sync_binlog, innodb_log_file_size) were either misconfigured or not configured at all, resulting in the poor performance.
When studying and testing a project:
1. Aim to fully understand the principles of the project by reading through its design documents and available white papers
2. Check the role and impact of each configuration, and identify the key configuration items
3. Conduct performance tests for multiple scenarios
4. Conduct stress tests where you run the project for several days continuously. Observe fluctuations in CPU, memory, disk IO, and other indicators.
5. Conduct fault tests: Kills, power-downs, restarts, switchovers, etc.
Can we simply move ahead once the first stage of study and testing is past and all issues have been addressed? Not so fast, the coast is yet to clear. Even with deep research and multiple tests, one must remain cautious, since these serve only to reduce risks, not eliminate them.
When we used TT, we actually arranged for an expert to take a look at the code and conduct some tests before we executed it. Even though that took a whole month, we still experienced many problems after going live. The online production environment is a complex beast, and is best approached with caution. In our experience, what works best is to use the project for secondary businesses rather than the core ones, and then rolling it out to more important sections phase by phase.
Even if all seems to be going well, it’s not yet time to sit back and relax, especially right after starting to use a new open source project. If luck must turn its face away, one may encounter a bug without a fix, or worse, one that is yet to be encountered by anyone else in the world. Such a situation can deal a fatal blow to business, especially if it involves storage failure.
A business known to some colleagues had a harrowing experience with MongoDB. After they lost a part of their critical data when the machine went down, they tried to restore it to no avail. What made it even worse was that they had no backup. The story spread fast in our own offices, and ever since, the O&M team has opposed every proposal for using MongoDB, even for trials.
Though completely opposing a project simply because of this may be a bit extreme, the story still serves as a wake-up call: When using open source projects for important businesses or data, it’s wise to keep a more mature solution as a backup rather than wringing hands when something goes wrong. For example, if one primarily uses Redis, they can keep MySQL for storage backup. Though this would add to cost and complexity, it is a necessary insurance against failure risks.
It is natural to feel compelled to make changes when we find that some aspect of an open source project does not meet our needs. How one goes about making such changes is the major challenge.
One way is to allocate a few people to thoroughly transform the project in a way that suits specialized business needs, but this isn’t ideal for a couple of reasons. First, such an investment runs too high. An open source solution of the same magnitude as Redis would take at least two people working over the course of four weeks to modify completely. Secondly, making too many modifications wipes away the ability to follow and incorporate evolved versions of the open source project.
This makes developing auxiliary systems such as monitoring and load balancing far more feasible. For example, if we want to add a clustering function to Redis, there’s no need to change its implementation. Instead, adding a proxy layer can help achieve the same end (Twitter’s Twemproxy follows the same principle). Since Redis began providing clustering with v3.0, all we had to do was switch from the original solution to Redis 3.0.
But what if you really want to make changes to the open source’s system? You could directly propose a requirement or report a bug to the developer, but there’s no guarantee of quick response time from their side. The method adopted really depends on the urgency of the matter at hand; for less urgent situations, it’s always better to arrange backups and emergency measures.
Regardless of whether we choose an open source project, the main trade-off between costs and benefits remains. There are times when using an open source project may not be the best way to go.
The most significant difference between the fields of software and hardware is that there are absolutely no conclusive industry standards when it comes to software — everyone enjoys the process of creating software the way they like. In hardware, creating a wheel with unique dimensions that cannot be applied to a vehicle is a waste of resources, no matter how good the technology or product quality; this is not a limitation in software, where anything created can still be of use, and combinations are not rigidly defined. Moreover, for large-scale applications, an open source project often adapts general-purpose processing solutions.
However, due to differences in business types, general-purpose solutions may not be suitable for across-the-board application. For example, Memcached provides clustering through consistency hash, but for some of our businesses, a cache crash slows down the entire business. This necessitates a cache backup function not available in Memcached. At the same time, when Redis did not have clustering, we assigned a team of around two to four people to work over two months to produce a set of caching frameworks that support storage, backup, and clustering. Based on this framework, we added cross-datacenter synchronization, drastically improving the availability of our platforms.
Using open source solutions and relying on them to implement specific requirements can either be time-consuming or entirely futile. Therefore, if cash flow, talent pool, and time permit, investing in creating your own solutions that perfectly suit your platform or services isn’t a terrible alternative. After all, all tech companies worth their salt do so — and without them, we would’ve never had so many good open source projects in the first place.
(Original article by Li Yunhua李运华)
First-hand and in-depth information about Alibaba’s latest technology → Search “Alibaba Tech” on Facebook