March 7, 2012
Departments were heavily impacted by last month's problems with the L&S web hosting service, coming as they did during the first week of classes. We apologize for the difficulties this outage caused your faculty, staff and students.
Now that our web sites are more or less back to normal, we are looking at our hosting infrastructure and procedures to determine if there measures we should have taken in the past, or can take in the future to mitigate the risk of downtime for our sites. We will make every effort to reduce the likelihood of a future outage, and to hasten the recovery from any such outage.
We'd also like to answer some of the questions you may have about the outage and about how we will reduce our risk going forward.
Some folks have asked how we came to choose Dreamhost as a hosting vendor. Dreamhost has a number of attractive features; they are a California-based company, running their own data centers in California (which is important for legal and privacy reasons), they have probably the best custom web hosting control panel environment in the industry, and they have a wide range of hosting options, from low-cost shared hosting plans up through dedicated hosting plans designed for enterprises. The system is flexible and scalable; it's easy to move your hosting to a higher tier of service. Dreamhost also does not meter disk or bandwidth usage, which is important for our shared environment--individual departments can't raise the cost for everyone else by using too much.
For these reasons, Dreamhost is probably the single most-used vendor for web hosting on campus. In particular, IST chose Dreamhost to host the socrates.berkeley.edu web sites when they shut down socrates, which meant that the Terms Of Service (TOS) were already vetted and approved by the campus.
Since we originally chose Dreamhost, their TOS agreement has changed somewhat; their online support, which used to be free, is now an add-on option, and it was not as responsive during this outage as we wanted it to be. We will be assessing whether staying with Dreamhost as a vendor makes sense. As noted in "Migration" below, the cost of moving elsewhere would be considerable.
Some of this is conjecture, but to the best of our knowledge, the sequence of events was:
During the recovery process we noted several gaps in our recovery procedures; all of them are applicable whether we're hosting with Dreamhost, a different vendor, or running web hosting locally.
The outage tested our communication channels, and opened up several new ones.
Dreamhost claims to have a "100% Uptime Guarantee." Most hosting vendors offer uptime guarantees (typically 99.9%), but the only thing the guarantee assures is a refund of a portion your monthly fee if the server is down too long. A small refund doesn't help us recoup the costs associated with a major server outage; the promise is empty.
There's no guarantee of uptime when we host with IST, or host ourselves. While downtime is going to happen for any system, the question of how outages are managed is important to address.
We are not happy with how Dreamhost managed this outage. The communication from the vendor was generally poor; there were 6 to 8 hours between most status updates, and the updates often lacked sufficient detail for us to respond to your questions.
We are in contact with Dreamhost to understand why the communication was inadequate, and how they plan to address this concern in the future; our assessment of whether or not we should stay with Dreamhost will largely depend on their response to this question.
Some departments are wondering if they should move their web hosting elsewhere. My assupmtion is that we will be better off in negotiations if we approach them as a single large entity rather than 100 small entities; it would be better to collaborate on this question than to fragment our response.
In a separate incident which coincidentally occurred around the same time as our outage, Dreamhost discovered a security breach which required them to reset all SFTP and shell passwords. This meant that web managers needed to reset the passwords they use to update files on theirs sites; most were able to do this through Dreamhost's web panel interface. See our story on the Dreamhost password update for more info.
Migration is expensive and time-consuming. If we had to do an emergency migration--if the server were down with no possibility of recovery--at minimum it would take 24 hours to point all the host names to a new server and bring everything up. It would then take many days to resolve issues with dependencies and custom configurations for our 100+ sites, and to create and communicate user accounts and passwords for all those sites. Everyone would be down for a day, and some sites would be down for multiple days.
In a planned migration, we could miminimze downtime, limiting it to zero for most sites, a few hours for others. But the planning would take time and resources: configuration, testing, and communication. An orderly exit from Dreamhost would take at least a month, probably more, with multiple folks working on the project.
So, while we are not happy with how the downtime was handled, we will talk with the vendor to see if they are willing to address issues around support channels and procedures, and will be investigating whether other alternatives are worth considering. It is a not a decision we will take lightly.
L&S runs its own web hosting because IST has not traditionally offered a service which fits the needs of our departments. There are a number of IST services which could be used for web hosting, but none provides a robust web hosting environment; to use them, we would have to build and maintain the kind of web environment which is provided as part of the cost by Dreamhost and its competitors. IST's web hosting services are also expensive relative to commodity vendors, and their support hours and uptime guarantees do not compare favorably with commodity vendors.
IST recently announced a managed third-party Drupal service which we will be evaluating as an option for sites running on Drupal (we currently have about 15 of those). At a minimum of $25/month/site, the cost is more than we are currently spending for hosting, and at the higher tiers is quite a bit more.
We feel that web hosting service should be provided by the campus at little or no cost to departments. The IST Service Advisory Council recommended that IST dissolve its current web hosting offerings in favor of a vendor offering. We expect that this will be a future direction, but it is some time away from being available and ready for our departments.
We again apologize for the troubles this outage caused for your departments at such a critical time of the semester. Please trust that we made every effort to reduce the downtime and the impact of the incident, and will continue to make every effort to learn from the recent events, and evaluate additional improvements we can make to minimize the risk of a similar outage in the future. We understand how important your web sites are to the work of your departments.
Please contact Tom Holub (tom@LS.berkeley.edu) if you have any comments or questions.