pooty wrote: ↑Sun Jan 14, 2024 1:49 pm From NFO
Re: Chicago issues
Jan 14 2024 11:48:10 AM PT The issues in Chicago are actually due to components in our routers overheating, due to a major datacenter outage with their cooling system. We are unable to stop the packet loss for now as a result. We are actively monitoring for updates from the facility on a fix.
Upstream link issue
FYI. Might see some network P/L today..
Re: FYI. Might see some network P/L today..
Omni test is back up!
- YEAAAHHHHHHHHHH
- Posts: 1120
- Joined: Tue Jun 08, 2021 3:03 pm
- Server Sponsor: Yes
- Server Admin: Yes
Re: FYI. Might see some network P/L today..
pooty wrote: ↑Sun Jan 14, 2024 1:49 pm From NFO
Re: Chicago issues
Jan 14 2024 11:48:10 AM PT The issues in Chicago are actually due to components in our routers overheating, due to a major datacenter outage with their cooling system. We are unable to stop the packet loss for now as a result. We are actively monitoring for updates from the facility on a fix.
Upstream link issue
- pooty
- Posts: 4535
- Joined: Sat Apr 03, 2021 10:22 am
- Location: Michigan
- Server Sponsor: Yes
- Server Admin: Yes
Re: FYI. Might see some network P/L today..
Code: Select all
Update @ 9pm CST: We are told that one of the broken chillers is back online now and that temperatures have stabilized at 120 degrees F. It is still not possible for us to effectively troubleshoot downed equipment, but we are monitoring very closely and will try to bring everything back online as soon as we are able to.
Update @ 9:41pm CST: The facility says that another chiller is back online and that temperatures are slowly decreasing now, but we have not seen a change yet in our equipment status. We are continuing to monitor and wait.
- Flounder
- Posts: 30
- Joined: Tue Apr 06, 2021 6:04 pm
- Location: SE Michigan
- Server Sponsor: Yes
- Server Admin: Yes
Re: FYI. Might see some network P/L today..
Will NFO give us access to their security cameras to verify when leon made an appearance?
- captainsnarf
- Posts: 2713
- Joined: Tue Jul 06, 2021 1:51 pm
- Location: Washington
- Server Sponsor: Yes
- Server Admin: Yes
- Contact:
Re: FYI. Might see some network P/L today..
looks like it's back now. HOC also
Re: FYI. Might see some network P/L today..
no, they didn't open the room windows yet
- pooty
- Posts: 4535
- Joined: Sat Apr 03, 2021 10:22 am
- Location: Michigan
- Server Sponsor: Yes
- Server Admin: Yes
Re: FYI. Might see some network P/L today..
Latest update
Code: Select all
Update @ 2am CST: As the temperature slowly goes down, our router is going longer before its network adapter overheats and it kills the connection. We are observing about 7 minutes of connectivity before it goes offline for a minute.
Our primary router and one of our network switches are still offline. We have asked the facility to investigate these, but they have told us that they will not turn back on any equipment for customers until temps drop further. We will pursue them.
Update @ 3:42am CST: The ambient temperature dropped a little further and our secondary router has not had a high-temperature disconnect error for a bit over 30 minutes now. This means that most customers have connectivity again.
We still have one switch offline, and our primary router offline, and we are lobbying the facility to investigate these ASAP. The router being offline is not causing customers problems because it is redundant, but the switch being offline is leading to some customers' machines or VDSes being inaccessible.
So far, we have seen a couple of machines that rebooted due to the heat, but we haven't noted any total hardware failures apart from the switch and primary router. We will be performing a complete audit of all equipment after temperatures are back in the normal range and the facility has restored the downed switch.
Update @ 3:57am CST: The switch that was offline is now online again; it seems to have left a temperature protection mode as the ambient temperature dropped. We are continuing to investigate the downed router and to look for any other equipment that might be having problems.
Please also note that because the facility's temperature is still high -- we are told that it is 88F now -- automatic CPU throttling may occur on machines at the physical system level, limiting performance. This should automatically resolve as the temperature drops further.
Update @ 9:08am CST: Customer equipment stayed online through the night, but the facility itself has not yet fully recovered, so we're not out of the woods yet. Equinix says that there was a slight increase in temperature during the night when two chillers failed and had to be restarted, and that they are in the process of installing additional portable coolers. They have not yet worked on our offline router.
- pooty
- Posts: 4535
- Joined: Sat Apr 03, 2021 10:22 am
- Location: Michigan
- Server Sponsor: Yes
- Server Admin: Yes
Re: FYI. Might see some network P/L today..
Seems like its all better now
Code: Select all
Update @ 3:57am CST on 1/15: The switch that was offline is now online again; it seems to have left a temperature protection mode as the ambient temperature dropped. We are continuing to investigate the downed router and to look for any other equipment that might be having problems.
Please also note that because the facility's temperature is still high -- we are told that it is 88F now -- automatic CPU throttling may occur on machines at the physical system level, limiting performance. This should automatically resolve as the temperature drops further.
Update @ 9:08am CST on 1/15: Customer equipment stayed online through the night, but the facility itself has not yet fully recovered, so we're not out of the woods yet. Equinix says that there was a slight increase in temperature during the night when two chillers failed and had to be restarted, and that they are in the process of installing additional portable coolers. They have not yet worked on our offline router.
Update @ 3:40am CST on 1/16: The facility reports that five out of six chillers are operational and the datacenter is within a normal temperature range again. They manually rebooted our primary router, and it came back online; we've now shifted loads back onto it.
Our audits have not identified any equipment that is not functioning properly, so everything appears to be back to normal now at this facility. We will continue to monitor, however, and follow up with the facility as they work to repair the sixth chiller and improve their overall cooling systems.