FYI. Might see some network P/L today..

General Comments, Questions about all things OmnipotentS that don't go in other topics/forums
ROGER
Posts: 13
Joined: Wed Nov 23, 2022 11:02 pm

Re: FYI. Might see some network P/L today..

Post by ROGER »

Omni test is back up!

pooty wrote: Sun Jan 14, 2024 1:49 pm From NFO
Re: Chicago issues
Jan 14 2024 11:48:10 AM PT The issues in Chicago are actually due to components in our routers overheating, due to a major datacenter outage with their cooling system. We are unable to stop the packet loss for now as a result. We are actively monitoring for updates from the facility on a fix.
Upstream link issue
User avatar
YEAAAHHHHHHHHHH
Posts: 1120
Joined: Tue Jun 08, 2021 3:03 pm
Server Sponsor: Yes
Server Admin: Yes

Re: FYI. Might see some network P/L today..

Post by YEAAAHHHHHHHHHH »

pooty wrote: Sun Jan 14, 2024 1:49 pm From NFO
Re: Chicago issues
Jan 14 2024 11:48:10 AM PT The issues in Chicago are actually due to components in our routers overheating, due to a major datacenter outage with their cooling system. We are unable to stop the packet loss for now as a result. We are actively monitoring for updates from the facility on a fix.
Upstream link issue
Image
User avatar
pooty
Posts: 4536
Joined: Sat Apr 03, 2021 10:22 am
Location: Michigan
Server Sponsor: Yes
Server Admin: Yes

Re: FYI. Might see some network P/L today..

Post by pooty »

Code: Select all

Update @ 9pm CST: We are told that one of the broken chillers is back online now and that temperatures have stabilized at 120 degrees F. It is still not possible for us to effectively troubleshoot downed equipment, but we are monitoring very closely and will try to bring everything back online as soon as we are able to.

Update @ 9:41pm CST: The facility says that another chiller is back online and that temperatures are slowly decreasing now, but we have not seen a change yet in our equipment status. We are continuing to monitor and wait. 
User avatar
Flounder
Posts: 30
Joined: Tue Apr 06, 2021 6:04 pm
Location: SE Michigan
Server Sponsor: Yes
Server Admin: Yes

Re: FYI. Might see some network P/L today..

Post by Flounder »

Will NFO give us access to their security cameras to verify when leon made an appearance?
User avatar
captainsnarf
Posts: 2714
Joined: Tue Jul 06, 2021 1:51 pm
Location: Washington
Server Sponsor: Yes
Server Admin: Yes
Contact:

Re: FYI. Might see some network P/L today..

Post by captainsnarf »

looks like it's back now. HOC also
User avatar
ankeedo
Posts: 42
Joined: Sat Jul 03, 2021 5:33 pm
Location: IRAQ
Contact:

Re: FYI. Might see some network P/L today..

Post by ankeedo »

captainsnarf wrote: Sun Jan 14, 2024 10:48 pm looks like it's back now. HOC also
no, they didn't open the room windows yet :lol: :lol:
User avatar
pooty
Posts: 4536
Joined: Sat Apr 03, 2021 10:22 am
Location: Michigan
Server Sponsor: Yes
Server Admin: Yes

Re: FYI. Might see some network P/L today..

Post by pooty »

Latest update

Code: Select all

Update @ 2am CST: As the temperature slowly goes down, our router is going longer before its network adapter overheats and it kills the connection. We are observing about 7 minutes of connectivity before it goes offline for a minute.

Our primary router and one of our network switches are still offline. We have asked the facility to investigate these, but they have told us that they will not turn back on any equipment for customers until temps drop further. We will pursue them.

Update @ 3:42am CST: The ambient temperature dropped a little further and our secondary router has not had a high-temperature disconnect error for a bit over 30 minutes now. This means that most customers have connectivity again.

We still have one switch offline, and our primary router offline, and we are lobbying the facility to investigate these ASAP. The router being offline is not causing customers problems because it is redundant, but the switch being offline is leading to some customers' machines or VDSes being inaccessible.

So far, we have seen a couple of machines that rebooted due to the heat, but we haven't noted any total hardware failures apart from the switch and primary router. We will be performing a complete audit of all equipment after temperatures are back in the normal range and the facility has restored the downed switch.

Update @ 3:57am CST: The switch that was offline is now online again; it seems to have left a temperature protection mode as the ambient temperature dropped. We are continuing to investigate the downed router and to look for any other equipment that might be having problems.

Please also note that because the facility's temperature is still high -- we are told that it is 88F now -- automatic CPU throttling may occur on machines at the physical system level, limiting performance. This should automatically resolve as the temperature drops further.

Update @ 9:08am CST: Customer equipment stayed online through the night, but the facility itself has not yet fully recovered, so we're not out of the woods yet. Equinix says that there was a slight increase in temperature during the night when two chillers failed and had to be restarted, and that they are in the process of installing additional portable coolers. They have not yet worked on our offline router. 
User avatar
pooty
Posts: 4536
Joined: Sat Apr 03, 2021 10:22 am
Location: Michigan
Server Sponsor: Yes
Server Admin: Yes

Re: FYI. Might see some network P/L today..

Post by pooty »

Seems like its all better now

Code: Select all

Update @ 3:57am CST on 1/15: The switch that was offline is now online again; it seems to have left a temperature protection mode as the ambient temperature dropped. We are continuing to investigate the downed router and to look for any other equipment that might be having problems.

Please also note that because the facility's temperature is still high -- we are told that it is 88F now -- automatic CPU throttling may occur on machines at the physical system level, limiting performance. This should automatically resolve as the temperature drops further.

Update @ 9:08am CST on 1/15: Customer equipment stayed online through the night, but the facility itself has not yet fully recovered, so we're not out of the woods yet. Equinix says that there was a slight increase in temperature during the night when two chillers failed and had to be restarted, and that they are in the process of installing additional portable coolers. They have not yet worked on our offline router.

Update @ 3:40am CST on 1/16: The facility reports that five out of six chillers are operational and the datacenter is within a normal temperature range again. They manually rebooted our primary router, and it came back online; we've now shifted loads back onto it.

Our audits have not identified any equipment that is not functioning properly, so everything appears to be back to normal now at this facility. We will continue to monitor, however, and follow up with the facility as they work to repair the sixth chiller and improve their overall cooling systems. 
Post Reply