Skip to main content

Day In the Life: Network Operations Manager

By Mal Fitzgerald, Sales Engineer

 

It’s early. I’m crushing only my second large coffee of the day and the four words I never want to hear are being uttered by folks sitting in front of my office: “The network is slow.”

History tells me that I won’t get much more in the way of details. I am going to have to put on my Sherlock cap and track this down before it becomes a bigger issue, whether I am fully caffeinated or not.

In my head, I begin to run through the list of usual suspects of network slowness:

  • Is it in the datacenter?
  • Is it one of our remote offices?
  • Is it Jim, the head of finance, who is working from his private island estate?
  • Could it be part of our cloud infrastructure?

Five years ago, I would have turned to my packet captures to try and save the day. The network was static, unencrypted and I was able to put test access ports (TAPs) in the most strategic places. But that was then and this is now. Application and service delivery mechanisms have changed drastically and my clients need access to apps and data from seemingly EVERYWHERE forcing me to build out a more dynamic, encrypted, ephemeral, and diverse network, or what I like to call DEED networking. This new design of networking required a move to tools that give me a more complete view of the network.

Having been burned by not having packets available when or where I needed them in the past, or having those packets only to find them encrypted and not very helpful to my investigations, I am not completely panicking when I hear “the network is slow”. As my network became “atomized” to accommodate my application delivery teams and a post-COVID, work from anywhere world, I found I had one commonality across all these dynamic network paths: flow – specifically NetFlow, sFlow and cloud flow logs.

Utilizing flow data from all corners of the physical and cloud network in the same SaaS-based portal allows me to view traffic from a single pane of glass. With the right tool, I can visualize my entire network, investigate issues as they arise, detect changes in network behavior from an operational and security standpoint, and integrate this platform into my entire technology stack to take automated actions when specific criteria is met. 

With flexible dashboarding and natural language query filtering, I have my views set up so that I can look at all the networks on a single dash, and quickly navigate down to specific networks that may be the cause. I can be working on my third cup of coffee in no time!

My high level, everything in one spot dashboard shows me my data centers are running normally, as is my small office in Boulder. Another benefit of flow versus packets is the amount of lookback I get because my flows have been ingested into a SaaS portal. I can go back months instantaneously to spot changes in trends, and I can see my CFO hasn’t been working from his private island in a few weeks!

However, this is where flow really pays off. After going through all my hardware-based networks, I don’t have to play “swivel chair quarterback” and change to other tools to look at my cloud flows. With our hybrid network, multi-cloud provider design, this is a huge time saver when troubleshooting! I don’t have to manage and learn three to four additional tools for the exact same outcome.

WHOA. Just quickly looking at cloud flows I can see something has changed in the past three hours. Using natural query language I should be able to filter this down quickly and get to the root cause. Unfortunately, cloud can be very IP centric, so finding and applying context in dynamic and ephemeral cloud workloads is really painful with the wrong tools. However, I utilize tooling that has the ability to ingest context for those flows. This allows me to put a name, region, virtual private cloud (VPC), and unique tag to an IP Address far more quickly than having to leave the tool and track down information within each cloud provider’s portal. Using my natural language filters along with this context, I’ve been able to spot a significant change in my network traffic, down to the region (US-East-1), individual VPC (our Production E-commerce VPC), and even the application owner (Dev Team Tampa).

With a last pivot to look at individual flow records, I can see two of the clustered front end servers came offline at the same time that we saw an uptick in incoming requests. My network-focused brain tells me one of my developers has made a bad change in the scalability settings. 

A minute later I’ve got a nice set of dashboards and flows so Dev Team Tampa has the info they need to resolve their application challenges, and I have proof that the network wasn’t actually slow.

Now it’s time for that third cup of coffee.