The 5th anniversary of the Healthcare.gov launch failure offers an opportunity to reflect upon computer system defects, human error, organization flaws, and the best principles and practices for solution delivery in the information technology industry. In this blog and my upcoming book**, Bugs: A Short History of Computer System Failure**, I will chronicle some important system failures in the past and discuss ideas for improving the future of system quality. As IT becomes increasingly woven into Life, the quality of hardware and software impacts our commerce, health, infrastructure, military, politics, science, security, and transportation. The Big Idea is that we have no choice but to get better at delivering technology solutions because our lives depend on it.
On 1 October 2013, the healthcare.gov website was launched under the provision of the Patient Protection and Affordable Care Act. Most website users experienced crashes, delays, errors, and slow performance throughout the days and weeks following its much heralded release. The site received over four million unique visitors on the first day, and only six people successfully enrolled in health plans that day. By some estimates, only 1% of interested users were able to enroll during the first week of operation during which the site had over eight million visitors. Even when users managed to register and shop online, they were later confronted by frustrating errors or confusing duplicates in the enrollment applications submitted to insurers. The site was taken down during the first weekend for major repairs because it was practically unusable. There were also multiple security defects found including insecure transmission of personal data, unvalidated password resets, error stack traces revealing internal components, and violations of user data privacy. The US Government Accountability Office (GAO) estimated that the US federal government and American taxpayer spent around $840 million USD developing the website. The healthcare.gov site was plagued by many mistakes common on large IT projects and systems; this essay will explore the major business, technology, and human factors that contributed to the launch failure of healthcare.gov.
According to the Organization for Economic Cooperation and Development (OECD) and the Kaiser Family Foundation (KFF), the US spends 17 percent of its gross domestic product (GDP) on healthcare, more than any nation in the world in terms of the total amount and relative to its economy’s size. The rate of healthcare spending is also increasing faster than the economy’s growth rate such that the Congressional Budget Office (CBO) estimates that health care related spending will consume 40 percent of US GDP by 2080. However, there is significant inequality and disparity in access to healthcare in the US. According to the KFF, the overall infant mortality rate in the US is 5.8 per 1000 live births in 2014; for blacks this rate is 10.9, while for whites the infant mortality rate is less than half that at 4.9. It is also worth noting these US infant mortality numbers are worse than the comparable OECD country average of 3.4. Again per the OECD and KFF, hospital admissions for preventable diseases such as congestive heart failure, asthma, diabetes, and hypertension are more frequent in the US than in comparable countries in the OECD. The US life expectancy of 79 in 2015 is less than the average of 82 for comparable countries in the OECD, and that trend has been getting worse since 1980 according to the KFF. Furthermore, 50 million Americans, almost one in five of the non-elderly, had no health insurance coverage in 2010. Many of these problems can be explained by the fact that the US is the only major industrialized nation without universal access to basic healthcare. With individuals, corporations, and the government all spending more on healthcare and the system delivering lower quality outcomes than other comparable countries, the US environment was ready for a change.
On 23 March 2010, President Barack Obama signed into law the Patient Protection and Affordable Care Act (ACA), the most comprehensive reform of the US medical system in fifty years. The ACA transformed the non-group insurance market in the US, mandated that legal residents have health insurance or pay a tax penalty, raised revenue through a variety of new taxes, subsidized private insurance coverage based on income-eligibility, reorganized spending under the country’s largest public health insurance plan for the elderly and poor (e.g. Medicare and Medicaid), and compelled insurers to make basic insurance plans available to all legal residents regardless of pre-existing conditions. The law required the establishment of health insurance marketplaces by January 1, 2014. The marketplaces permit individuals to compare and select insurance plans offered by private insurers. For states that elected not to establish a marketplace, the Centers for Medicare and Medicaid Services (CMS) was responsible for providing a federally facilitated marketplace (FFM). In September 2011, CMS signed contracts with key vendors and began development of the healhcare.gov system in December 2011.
Healthcare.gov Business Workflow
“Now, this is real simple. It’s a website where you can compare and purchase affordable health insurance plans, side by side, the same way you shop for a plane ticket on kayak… the same way you shop for a TV on Amazon.” — President Obama @ Prince George Community College in Maryland Late September 2013
Healthcare.gov Project Organization Chart
Although the healthcare.gov project had only just begun, the seeds for future difficulties had already been planted in several areas. First, CMS did not have the People in terms of organizational experience for developing and managing large IT systems. Other US government agencies such as the Department of Defense (DoD) and National Aeronautics and Space Administration (NASA) had decades of institutional experience in developing, delivering, and operating reliable IT systems. Serving as the lead project manager and integrator for a complex, first-of-its-kind system was beyond the capability of CMS, and a better decision would have been to assign this responsibility to a primary vendor. Second, there was a project Leadership gap. Although CMS and its Deputy CIO (Henry Chao) managed the project in theory, other key members of the project steering committee including the White House’s CTO (Todd Park), executive Office of Health Reform (Jeanne Lambrew), and the Department of Health and Human Services (Kathleen Sebelius, Bryan Sivak) exercised more influence and power in practice; they delayed important decisions at the start and increased scope at the end which contributed to the launch failure. Additionally, no one at the various agencies involved in the project had visibility on all the critical milestones that each different group needed to reach in order to complete the project successfully. Remarkably, the executive Office of Management and Budget (OMB) did not take an active role in the project although this was well within its responsibilities especially when one considers that it had established an executive IT dashboard to monitor large federal IT projects and investments since 2009. The irony of the government demonstrating poor project governance practices should not lost be upon the reader, a theme we shall oft return to. One subtle aspect of the Leadership issue was the Cognitive Bias inside the White House about its unique ability to leverage IT. The Obama campaign teams had pioneered the use of social media and data mining in the 2008 and 2012 presidential elections, respectively. These successes resulted in executive overconfidence and set unrealistic expectations about what could be immediately delivered. As we will see later, this cognitive bias would haunt the healthcare.gov launch when the deadline loomed.
Third, there were serious Procurement problems found by the GAO and the HHS Office of Inspector General who conducted separate audits into the launch of the healthcare.gov website. According to the GAO and HHS audit reports, CMS awarded sixty (60) contracts to thirty three (33) different vendors, but did not satisfy important aspects of Federal Acquisition Regulation (FAR) which is the common procurement framework across US federal agencies. The largest contract was granted to Conseillers en Gestion et Informatique (CGI), a Canadian IT company headquartered in Montreal that employed more than 70,000 people, reported revenues of $10 billion in 2013, and was worth more than $11 billion on the TSX in 2013. CGI was a roll-up conglomerate that grew by acquiring others; it had expanded into US government contracts with two recent investments: the 2004 purchase of American Management Systems and the 2010 purchase of Stanley. One of the procurement problems identified by the GAO and HHS reports was CMS did not clearly assign a lead system integrator role to any of the primary vendors selected in the key contracts until after the October 2013 launch. The CMS CIO reported to the OIG that CMS perceived CGI to be the project’s lead integrator, but the company stated it did not have the same understanding of its role. Second, CMS had not prepared a written procurement strategy to describe the overall procurement process of healthcare.gov and document the factors, assumptions, risks, and tradeoffs that would guide project decisions. CMS leadership claimed that they were unaware of this FAR requirement, further confirming their lack of experience with managing large federal IT projects. Third, CMS did not conduct thorough past performance reviews of potential contractors and did not use the federal government’s own contractor database (e.g. PPIRS) when evaluating bids on four of the six key contracts. Fourth, CMS used previously negotiated contracting vehicles for conducting large portions of the FFM project that avoided procurement review, were cost-plus, and put much of the financial risk on the government instead of sharing it with the vendor. In 2007, the CMS established an Enterprise System Development (ESD) Indefinite Delivery Indefinite Quantity (IDIQ) contract for its ongoing IT needs, conducted a full and open competition, and awarded the ESD IDIQ contract to sixteen (16) companies. The IDIQ contract type streamlines future procurement decisions, does not require the usual oversight from the Contract Review Board (CRB) irrespective of dollar value, and per HHS procurement guidelines does not require contract acquisition plans. Fifty-three of the sixty FFM contracts including the six key contracts did not have individual acquisition plans, fifty-five of the contracts were awarded as orders under previous contracts, and only two of the six key contracts were reviewed by the CRB. Furthermore, CMS solicited a proposal from only one company for one-third of the sixty contracts. The CMS justification for all these issues was ignorance of regulations and the need for speed in kicking off the project without a lengthy procurement process. CGI was eventually fired and then replaced by Accenture in January 2014 as the lead contractor.
Healthcare.gov Architecture inferred by AppDynamics
The fourth major factor contributing to the system launch failure was the Complexity of the technology architecture. The healthcare.gov system consisted of four primary technology components: a front-end website for the FFM, a back-end data services hub, an enterprise identity management sub-system, and the hosting infrastructure. The first component was the healthcare.gov website; the site allows users to create an account, input required information, determine eligibility for financial assistance, view health care plan options, select a plan, and then pay for the plan. The website’s UI was originally developed by CGI and Development Seed using several web tools including Bootstrap, CSS, jQuery, Jekyll, Prose.io, and Ruby. Analysis by AppDynamics and Mountain River using FireBug, Grease Monkey and YSlow revealed that site page templates contained more than ninety (90) external resource references including around forty (40) CSS and fifty (50) JavaScript files. Simple web pages during the registration process could regularly take 8 seconds to load, and AppDynamics reported latency of 71 seconds for user account registration pages with client-side loading and rendering taking 12 seconds and server-side responses taking 59 seconds; the website Performance was unacceptable to users and was another major factor contributing to the system launch failure. Most web content frameworks have resource optimization features to aggregate and minify CSS and JS files, but it appears that the engineering team was not aware of these features or just did not test enough, another theme we shall revisit repeatedly. Smart Bear and others also found evidence of sloppy code in the front-end with Lorem Ipsum filler text littering the website pages from machine generated HTML5 code as well as typo’s and editorial comments that are usually removed from software builds released to production through formal peer review. The second component was the back-end data services hub developed by CGI and QSSI using Java, JBoss, Web API’s, and the NoSQL Marklogic database. The hub’s responsibility was to orchestrate data and services from multiple external sources such as agent brokers, insurers, CMS, DHS, Experian, IRS, state insurance exchanges, and SSA. According to Enterprise Tech magazine’s interviews with engineers at Marklogic, the Java middleware objects were machine generated for rapid development; their consensus was “they’ll perform well for 1,000 users, just not 100,000 [concurrent] users because there’s so much overhead built-in.” The data hub confirms an applicant’s Social Security number by routing the request to the Social Security Administration (SSA). The hub verifies a user’s citizenship and immigration status by forwarding the request onto DHS. The data hub similarly confirms eligibility for financial assistance by connecting to the IRS and requesting user information on income and family size. The hub verifies a user’s residency and employment by forwarding the request onto Experian. The hub also communicates with insurers by sending enrollment requests as EDI forms F834 and handling enrollment responses. Many users reported a variety of enrollment errors ranging from garbled data, to outright F834 syntax mistakes, and frustrating duplication with multiple enrollments submitted for a given user. While unit and component testing could be completed by different teams in isolation, integrating the system requires coordinated planning and testing by a skilled team of teams because the common developer refrain of “my stuff works on my machine” is more problematic and compounded when integrating complex software written by distributed teams. The third component was the Oracle Enterprise IDentity Management (EIDM) system for which QSSI was also responsible. Experts believe this identity service was a serious bottleneck because the data services hub synchronously invoked this service on every user request for authentication and authorization. The system had only been tested with an expected load of a 2,000 concurrent users instead of the tens of thousands of concurrent users that actually visited the site during the first week. The project team’s war room notes for the week of system launch are sprinkled with comments about additional server hardware being setup for ensuring EIDM capacity and ongoing software issues in the EIDM itself. The fourth system component was the hardware infrastructure hosting the website, FFM, data services hub, and EIDM. The front end website was hosted on Akamai’s Content Delivery Network (CDN). The original back end infrastructure consisted of forty-eight (48) VMWare virtual machine nodes running Red Hat Enterprise Linux (RHEL) hosted on twelve (12) physical servers located in a single Terremark data center. Oddly, some of the VMWare servers ran vSphere v4.1, and others ran v5.1. Furthermore, according to Enterprise Tech’s interview with Marklogic, the network was also misconfigured to run at at 1 Gb/sec which was below its full capacity of 4 Gb/sec. With over 8 million visitors in the first week and thousands of concurrent users, the site was a victim of its own success, and the GAO audit report confirms that CMS did not adequately plan and setup hardware capacity for the back-end. On 28 October 2013, there was also an unplanned network outage in the data center hosting the site’s backend data services. Considering the presence of issues within and across multiple technology components, there was insufficient Architecture and Quality Assurance on this Complex system. The system defects in the website, enrollment, identity, and infrastructure services should have been identified and fixed before launch, and they were further proof that CMS was in over its head as the technology project lead and probably putting that head in the sand when sharing status updates with the project steering committee in the late summer of 2013.
Healthsherpa.com — Small is Beautiful
Project Management strategy and tactical execution directly contributed to the launch failure of healthcare.gov. First and foremost, the Scope should have been limited to a modest beta upon launch. The Minimum Viable Product (MVP) could have been a simpler site that allowed users to compare different health plans and then purchase the insurance by going directly to the insurer’s website or using manual, offline processes that were reliable and secure. In fact, the healthsherpa.com site does just that. Its alpha version was built by three people, and it was delivered to production in a month. The site did not require user registration, used only zip code, age, and income to filter choices, and was initially based upon a spreadsheet export from healthcare.gov conforming data from the insurance health plans available in the different states. Healthsherpa.com was a Y-Combinator startup, backed by Kapor Capital, and founded by George Kalogeropoulos, Ning Liang, and Michael Wasser; this site has helped more than a million people sign up for health insurance through the ACA and continues to be an active, useful service today. Another issue affecting the project’s Scope and Schedule was the delay in the publication of regulations that would guide system development. This resulted in some tasks that could not start and others that had to be redone. Some of the regulatory delay could be attributed to the gridlocked Politics after the midterm elections of 2010 and the administration’s desire to postpone controversial rulings until after the presidential elections of 2012. For example, HHS issued the final rules on private insurance as late as February 2013; these rules included insurance premiums, coverage availability, risk pools, and catastrophic plans. These business rules had to be translated into system requirements and then developed into software. Furthermore, HHS also delayed the date by which states had to commit to either operating a state exchange, partnering with the federal government, or letting the federal government run the exchange. Initially, states had to decide by November 2012, but HHS delayed the deadline until December 2012 with some states slipping to February 2013. Postponing these decisions reduced the time the time that CMS and states had to connect their systems together. Politics and the overconfidence bias also surfaced in August 2013 when the White House and executive Office of Health Reform insisted on requiring site user registration before shopping for insurance so that concrete user numbers could be shown as proof of the system’s success. According to a Senate audit report, this goal line audible and project change was not communicated to QSSI and this created further discrepancy between the actual and expected load on the EIDM. Interestingly, the existing health insurance exchanges of Kentucky and Massachusetts that were inspirations for healthcare.gov both allowed users to browse and “window shop” anonymously much like healthsherpa.com. The bottom line is with just over a month before the release deadline, Scope was still expanding on the healthcare.gov project when the focus should have been on mitigating the myriad risks, shrinking scope, testing the system end-to-end with expected load, revisiting hardware capacity plans, conducting security audits, documenting the release checklist, and planning operational support. Experienced project managers are familiar with the classic triangle of project constraints: Scope, Cost, and Quality. It is a challenge to satisfy all three; the usual balancing act especially at the end of a project involves one of the three giving way to the other two. Since Scope was changing and the Time (Cost) deadline was fixed, one could expect that the Quality would decrease, and this is exactly what happened. The GAO report describes in September 2013 (less than 1 month to go live) that CMS had identified 45 critical and 324 serious code defects across FFM modules. Second, the project should have adopted a true Agile System Development philosophy and a lean software manufacturing process such as Scrum or Kanban in which the sprints are driven by the top priorities, include the full end-to-end testing of the technology solution, and produce real, deliverable artifacts that users can get business value from. Media reports that some of the component teams refined the UX through wireframes, worked in sprints, had a war room with story cards on the walls, and published code on Github were a glossy veneer around the Waterfall process that stacked component construction in parallel and delayed integration testing until the end. Third, the project steering committee should have elected one project Executive with the authority to make final decisions after consultation with key stakeholders. The diffuse distribution of authority, management by consensus, and lack of accountability on the project led to the tragedy of the commons in which everyone had a piece of this or that, but no one was in charge. Of course, a leader can only make good decisions if they are well informed, so coupled with authority, they needed transparency and status visibility for critical path components that different groups were working on as well as unified release calendar for all the critical path milestones.
One might reasonably wonder that with all these problems whether there was someone sounding the alarm to change course before the release deadline: delay the project, strengthen the leadership role, reduce and freeze scope of the v1.0, test the system A-to-Z, and ensure that sufficient resources were available to launch the system successfully and operate it going forward. While system development was ongoing, CMS did obtain independent audits from several private sector firms including McKinsey, Mitre, and Turning Point. The McKinsey report was based on analysis of project documents, interviews with project officials, and participation in meetings to assess and influence the “facts on the ground”. The McKinsey report released in April 2013, highlighted the initiative’s Complexity, and identified more than a dozen critical risks spanning the marketplace technology and project governance. In the latter category, the report mentioned the waterfall SDLC process, uncertainty of v1.0 requirements, multiple definitions of success, heavy dependence on 3rd party contractors, parallel stacking of phases, inadequate integration testing window, the Big Bang launch volume, the matrix management, the absence of clear leadership roles and responsibilities, and the lack of an end-to-end operational view of critical path interdependencies across agencies or within an agency. The report also described several options to mitigate these risks, but the project steering committee did not act upon the McKinsey report’s findings and recommendations before system launch.
In late October 2013, HHS announced several changes to the healthcare.gov project. First, project management was centralized and led by Jeffrey Zients, former OMB director. Zients had been a successful management consultant and CEO in the corporate world; he also had a rockstar reputation within the White House for solving tough problems and managing successful teams. Second, Todd Park, White House CTO, reorganized the technology leadership team, demoted some of the underperforming CMS employees and 3rd party contractors, and recruited top talent from Silicon Valley for a government sabbatical to save the site. Third, a Tiger team was formed with the narrow mandate of getting Production working properly; the team scrummed daily, triaged existential risks, and prioritized the most important defects on the critical path. Over the next six weeks, the team fixed around 400 system defects, increased system concurrency to 25,000 users, and improved website page responsiveness to one second. Enrollment jumped from 26,000 in October to 975,000 in December. The tech surge worked so well that many of the Silicon Valley fellows stayed on to establish the successful US Digital Service organization in 2014 to transform important, public-facing digital services provided by the government. If the careful reader is willing to put aside partisanship and look beyond the politics and economics of the ACA, there are many lessons that one can learn from the failed launch and recovery of the healthcare.gov system.
References