Jump to content
Albert

Explosion Takes Out AtariAge [Updated]

Recommended Posts

This coincided with a work e-mail ban on youtube, facebook etc, so I thought I had lost access at work permanently. It was good (but bad) when it didn't work at home either

Share this post


Link to post
Share on other sites

when we were was down, the first thing I thought was "better post this problem to see what happened!" I didn't think it was atariage, must less an explosion! I thought it was my crappy computer running vista just spazing out on me...

Share this post


Link to post
Share on other sites
when we were was down, the first thing I thought was "better post this problem to see what happened!" I didn't think it was atariage, must less an explosion! I thought it was my crappy computer running vista just spazing out on me...

After I soaked my computer in bleach, it has much less of a musty smell, much less visible mold. It's not going to explode any time soon. :D

Share this post


Link to post
Share on other sites

There was a sweet power outage at 365 main last year. Took out craigslist, technoraiti, typepad, myspace, etc.... 365 main has n+2 redundancy that is so beautiful it makes engineers cry when we read it. There are no batteries, utility power powers a hitec UPS which has a constantly spinning 2 ton fly wheel which powers geneartors which power the building. In the event of a power failure, the diesel engines kick on and keep the flywheel spinning.

 

The extreme techs out there may greatly enjoy this lengthly discussion of the problem and investigation.

 

 

Hello,

 

At approximately 13:53 PDT, our network monitoring tools indicated an

issue with our router out of San Francisco in 365

main facility. If you are receiving this email, then one or more of your

circuits is homed to this particular device in our nap. At this time,

we are seeing the line protocol of the circuits in down states.

 

We are investigating this issue, and will provide updates as they become

available.

 

------------------

 

365 Main has advised that they intend to stay on generator power for the

majority of the night. They are working to ensure that utility power is

100% stable before transferring the load.

 

We will be working with 365 Main to determine exactly what systems failed

and why power was not transferred to UPS/generator as expected. An

official explanation will be forthcoming.

 

--------------------

 

> At 1:49 p.m. on Tuesday, July 24, 365 Main's San Francisco data center

> was effected by a power surge caused when a PG&E transformer failed in

> a manhole under 560 Mission St.

>

> An initial investigation has revealed that certain 365 Main back-up

> generators did not start when the initial power surge hit the building.

> On-site facility engineers responded and manually started effected

> generators allowing stable power to be restored at approximately 2:34 p.m.

> across the entire facility.

>

> As a result of the incident, continuous power was interrupted for up

> to 45 mins for certain customers. We're certain colo rooms 1, 3 and 4

> were directly affected, though other colocation rooms are still being

> investigated. We are currently working with Hitec, Valley Power

> Systems, Cupertino Electric and PG&E to further investigate the

> incident and determine the root cause.

>

> Utility power has been restored but all generators will continue to

> operate on diesel until the root cause of the event has been

> identified and corrected. Generators are currently fueled with over 4

> days of fuel and additional fuel has already been ordered.

 

--------------------- (here's where it gets good)

 

> UPDATE: 5:00 P.M., Wednesday, July 25, 2007

>

> A complete investigation of the power incident continues with several

> specialists and 365 Main employees working around the clock to address

> the incident.

>

> Generator/Electrical Design Overview

>

> The San Francisco facility has ten 2.1 MW back-up generators to be

> used in the event of a loss of utility. The electrical design is N+2,

> meaning

> 8 primary generators can successfully power the building (labeled

> 1-8), with 2 generators available on stand-by (labeled Back-up 1 and

> Back-up

> 2) in case there are any failures with the primary 8.

>

> Each primary generator backs-up a corresponding colocation room, with

> generator 1 backing up colocation room 1, generator 2 backing up

> colocation room 2, and so on.

>

> Series of Electrical Events

>

> * The following is a description of the electrical events that

> took place in the San Francisco facility following the power surge on

> July 24, 2007:

>

> o When the initial surge was detected at 1:47 p.m., the

> building's electrical system attempted to roll all colocation rooms to

> diesel generator power.

>

> o Generator 1 detected a problem in its start sequence and shut

> itself down within 8-10 seconds. The cause of the start-up failure is

> still under investigation though engineers have narrowed the list of

> suspected components to 2-3 items. We are testing each of these

> suspected components to determine if service or replacement is the

> best option. Generator 1 was started manually by on-site engineers

> and reestablished stable diesel power by 2:24 p.m.

>

> o After initial failure, Generator 1 attempted to pass its 732

> kW load to Back-up 1, which also detected a problem in its start

> sequence. The exact cause of the Back-up 1 start sequence failure is

> also under investigation.

>

> o After Generator 1 and Back-up 1 failed to carry the 732 kW,

> the load was transferred to Back-up 2 which correctly accepted the

> load as designed.

>

> o Generator 3 started up and ran for 30 seconds before it too

> detected a problem in the start sequence and passed an additional 780

> kW to Back-up 2 as designed.

>

> o Generator 4 started up and ran for 2 seconds before detecting

> a problem in the start sequence, passing its 900 kW load on to Back-up

> 2. This 900kW brought the total load on Back-up 2 to over 2.4 MW,

> ultimately overloading the 2.1 MW Back-up 2 unit, causing it to fail.

> Generator 4 was manually started and brought back into operations at

> 2:22 p.m. Generator 4 was switched to utility operations at 7:05 a.m.

> on 7/25 to address an exhaust leak but is operational and available in

> the event of another outage.

>

> o Generators 2, 5, 6, 7 and 8 all operated as designed and

> carried their respective loads appropriately.

>

> o By 1:30 p.m. on Wednesday, July 25, after assurance from PG&E

> officials that utility power had been stable for at least 18+

> continuous hours, 365 Main placed diesel engines back in standby and

> switched generators 2,5,6,7, 8 to utility power.

>

> * Customers in colocation rooms 2, 4, 5, 6, 7 & 8 are once gain

> powered by utility, and are backed up in an N+1 configuration with

> Back-up 2 generator available.

>

> * Generators that had failed during the start-up sequence but

> were performing normally after manual start (1 & 3) continue to

> operate on diesel and will not be switched back to utility until the

> root causes of their respective failures are corrected.

>

> Other Discoveries

>

> * In addition to previously known affected colocation rooms 1, 3

> and 4, we have discovered that several customers in colo room 7 were

> affected by a 490 millisecond outage caused when the dual power input

> PDUs in colo 7 experienced open circuits on both sources. A dedicated

> team of engineers is currently investigating the PDU issue.

>

> Next Steps

>

> * Determine exact cause of generator start-up failure and PDU

> issues through comprehensive testing methodology.

>

> * Replacements for all suspected components have been ordered

> and are en route.

>

> * Continue to run generators 1 & 3 on diesel power until

> automatic start-up failure root cause is corrected.

>

> * Continue to update customers with details of the ongoing

> investigation.

 

------------------------

 

========================================

UPDATE: 4:30 P.M., Sunday, July 29, 2007

========================================

 

 

SUMMARY

 

. Comprehensive testing of speed control and regulation components

and their relationship to the startup sequences of the generators continues.

 

. Unit 1 continues to be the focus of initial testing. During over

100 start/stop tests late Saturday night on Unit 1, the investigation team

was able to simulate failure. The digital controller for the diesel engine

(know as a DDEC) has proven erratic and a spare DDEC is en route. While

this component is the focus of the investigation, the team continues

start/stop testing to rule out other potential contributors to failure.

 

CURRENT GENERATOR STATUS

 

. Operational status is unchanged. The overall facility continues

to operate at N+1 power system redundancy. Unit 1 remains offline, pending

some additional testing. Unit 3 continues to support customer load in Diesel

mode. All other units are supporting customer load in Normal mode.

 

OTHER DISCOVERIES

 

. None

 

NEXT STEPS

 

. Determine exact cause of diesel engine synchronization failure and

PDU issue.

 

. Continue to run generator 3 on diesel power until diesel engine

synchronization failure root cause is corrected.

 

. Continue to update customers with details of the ongoing

investigation. Reports will be posted each day at 4:30 p.m. until root

cause is determined.

 

------------------------------------

 

 

> UPDATE: 4:30 P.M., Monday, July 30, 2007

>

> SUMMARY

>

> # The investigation team further pinpointed the digital

> controller of the generator unit as the probable root cause

> of failure in Unit 1. After altering the suspect DDEC

> component, the investigation team was able to successfully

> start Unit 1 over 60 times without incident.

>

> # Hitec technicians performed a series of tests that

> confirmed the timing sequence for the DDEC controller was the

> probable cause of the 7/24 failure-to-start event.

>

> # Having corrected root cause on Unit 1, the team

> successfully returned Unit 1 to utility without incident and

> turned testing focus to Unit 3.

> Unit 3 was first powered down for inspection of the clutch

> oil and proactive replacement of the clutch oil feed grommet

> (the unit had been running in diesel mode for 6 days). When

> the inspection and repairs were completed, the team was able

> to fail the unit and observed the same error in the DDEC

> component. Technicians have implemented the DDEC fix on Unit

> 3 and are in the process of verifying this was the root cause

> of start sequence failure on Unit 3.

>

>

> CURRENT GENERATOR STATUS

>

> # Operational status has changed since the last update. Unit

> 1 has finished testing and repairs and has been returned to

> Normal operation supporting customer load. Customer loads

> have been transferred from Unit 3 which had been operating in

> Diesel mode since Tuesday. All other units continue to

> support customer loads in the Normal mode. The overall power

> system redundancy remains at N+1.

>

>

> NEXT STEPS

>

> # Complete root cause analysis and implement fixes on all

> affected generators. Return building to normal operations

> once testing is complete and stability is proven.

>

> # Publish complete details of the investigation for 365 Main

> customers.

 

-----------------------

 

 

 

UPDATE: 4:30 P.M., Tuesday, July 31, 2007

 

SUMMARY

 

* Generator investigation

 

o DDEC was confirmed as root cause on each affected generator. All units have been fixed and returned to normal operation.

 

* PDU investigation in colo 7

 

o The 490ms outage in colo 7 occurred at the PDU not at the generator/fly-wheel UPS (Hitec).

 

o The PDU has 2 sources of power; primary (Source 1) and back-up (Source-2). During the power event there was a power surge on the primary (Source 1) over 11%, more than the allowable setting of the PDU device. The PDU tried to switch to the redundant (Source 2) power supply. That power supply was trying to accept the load from three other COLO rooms and reached an overload condition and did not allow the transfer. The PDU then switched back to the primary (Source 1) supply. During this time 490ms passed before the loads on the COLO 7 PDUs were put on the unit 7 generator/fly-wheel UPS (Hitec).

 

o To correct this issue, we have set the over-voltage/under-voltage parameters to (+/-) 20% for all units in the building. PDUs performed normally following the change. 365 Main is implementing this change on all PDUs in all facilities.

 

CURRENT GENERATOR STATUS

 

* Following thorough testing and the successful implementation of the DDEC fix across all units, all generators are currently operating normally. The overall power system redundancy has returned to N+2.

 

--------------------------------------

Share this post


Link to post
Share on other sites

Really interesting read which against proves that NOTHING is fail safe.

 

So what does N+2 mean? If N = 2, then we have complete redundancy, nice. But if N = 1000? Then +2 means almost nothing. So all this N+x stuff is just marketing BS, IMO.

 

There has to be some better measure.

Share this post


Link to post
Share on other sites
So what does N+2 mean? If N = 2, then we have complete redundancy, nice. But if N = 1000? Then +2 means almost nothing. So all this N+x stuff is just marketing BS, IMO.

N+2 means that if N units are required to power the facility, the facility will keep an extra 2 units available in case of failure. If N=2, then the facility will have 4 units on site. If N=1000, then the facility will have 1002 units on site. For the most part this works regardless of scale because the units are expected to be reliable. N+1 should be all that's needed to recover from a failure. N+2 offers the ability to continue running after a failure AND still have backups available in case there is another failure before the previously failed unit is repaired.

 

In this case, 8 units were required to power the facility. Thus the facility had 10 units on site. Unfortunately, there appears to have been 4 separate failures in the generators, resulting in the facility being underpowered. The majority of these failures appears to trace back to a problem with the DDEC controller units. Possibly a design flaw or a software bug. Either way, the automatic startup of the generators was not functioning properly due to faulty components provided by the manufacturer.

 

This is every support person's worst nightmare. All the planning and configuration is worth nothing if the parts you purchase end up being unreliable. Any flaw in those parts is likely to be present in all copies of those parts. This is why fields where failure results in death tend to have backup systems developed by an independent contractor. Even if a fault develops in the primary system, the backup is unlikely to repeat the same flaw.

Share this post


Link to post
Share on other sites
For the most part this works regardless of scale because the units are expected to be reliable. N+1 should be all that's needed to recover from a failure. N+2 offers the ability to continue running after a failure AND still have backups available in case there is another failure before the previously failed unit is repaired.

Depends on the reliability you assume. Even with 99% reliability, 2 reserve systems are too few for N=1000.

 

This is every support person's worst nightmare. All the planning and configuration is worth nothing if the parts you purchase end up being unreliable. Any flaw in those parts is likely to be present in all copies of those parts.

Right. But those should become obvious in a test. I seems they never tested their theoretically "great" backup system.

Share this post


Link to post
Share on other sites
This is every support person's worst nightmare. All the planning and configuration is worth nothing if the parts you purchase end up being unreliable. Any flaw in those parts is likely to be present in all copies of those parts.
Right. But those should become obvious in a test. I seems they never tested their theoretically "great" backup system.

In theory, everything works in practice. ;)

 

Doing a test on a running facility is prohibitively expensive and difficult to do. So such tests are usually simulated. The unfortunate fact is that test situations don't always match up to the reality. Sometimes the problems are stupid ones that can't be avoided. Allow me to give a hypothetical example.

 

Let's say your equipment is specced out with a certain range of expected phase variances on startup. Let's also say that those phase variances are within what's considered the industry norm. Ok, good. Your equipment should do the job just fine. But let's also say that this equipment is new high-end stuff that produces far more power than traditional generators. (Multi-megawatt generators are relatively new to data centers. Their high demand has been a limiting factor in building out new centers.) Except, let's say, the amount of power produced creates a greater phase variance than expected for a longer period than expected. The equipment may still be capable of being run until it is brought into phase, but the automated systems will have no way of knowing this. Thus the controller detects the larger-than-expected variance, notes the period for which it has continued, then decides that there is a problem with the power grid and shuts down to prevent equipment damage.

 

In a situation like that, all the specs were met, the equipment was probably tested for the expected variances, but no one realized that a situation would occur. Such a problem is a difficult situation to detect in testing.

 

I see similar issues in the software development I do. You code something that tests fine under all the test cases you can come up with. Then the users come along and start doing things that you never thought they'd try to do. Suddenly, you have an emergency on your hands as you scramble to update the system to handle this edge case that was not accounted for. (This is why I always like to find the most bumbling QA person I can find. The more whacked out, unexpected, unpredictable situations they can find, the better! ;))

Share this post


Link to post
Share on other sites
Depends on the reliability you assume. Even with 99% reliability, 2 reserve systems are too few for N=1000.

 

Depends what you mean by "99% reliability"? Over what time period?

 

Units which are 99% reliable over the course of a year will be about 99.98% reliable over the course of a week. If there are 1,000 such units, there will be an 17.57% chance of one failing in a week, a 3.09% chance of two failing in a week, or a 0.54% chance of three failing. If repairs would be expected to take a week, that's obviously not acceptable, but suppose repairs were expected to take only half a week. In that case, the units would be 99.990% reliable over the course of a half-week, so with 1,000 units there would be a 9.56% chance of one failure in a half-week, 0.91% of two failures, or 0.087% chance of three failures. Still not good enough, but one wouldn't have to add many more spares to make it so. Even in a field of 1,000 not-super-reliable units, having four backups would reduce the probability of a critical failure to less than 0.001%/half-week, or 0.08%/year.

Share this post


Link to post
Share on other sites

Someone ask if the store is down for good or not. Sorry if I missed the response but I was wondering the same myself? Is the AA store coming back soon?

Share this post


Link to post
Share on other sites

I got to see The Planet's offices in Downtown Houston about a week ago. My friend's fiancée has been painting murals in their conference rooms based on the name of the room. She has a couple of them on her site - The Tank and The Hill. The Hill is a view of the location the offices are at, before the building existed - the 1930s, if I recall correctly. The Esperson Building can be seen in the background, it was built in 1927.

Share this post


Link to post
Share on other sites

Good to see that AtariAge is still here! Also good to know that everyone and everything is ok. I had my Lynx to keep me company during your absence...but I did miss the forum chats on here. Good to have you back!

Share this post


Link to post
Share on other sites
I think those guys had a meth lab going down there hence the explosion.

 

 

Anyhow I am really glad everything is all right, and now I know what jail feels like.

 

heh. that's not all that uncommon here in Houston... stupid meth junkies

Share this post


Link to post
Share on other sites
Welcome back Atariage! You had me worried last night.

Uhhr, what happened last night? The site was not down..

 

..Al

 

 

I was not able to get on for hours. (And I was at home and able to access other sites fine.)

Edited by doctorclu

Share this post


Link to post
Share on other sites
I was not able to get on for hours. (And I was at home and able to access other sites fine.)

Probably a network issue somewhere--the site was fine, and you can take a look at the "Today's Active Topics" link on the front page to see that threads have been active during the last 24 hours without a large gap of time.

 

..Al

Share this post


Link to post
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...

  • Recently Browsing   0 members

    No registered users viewing this page.

×
×
  • Create New...