Wednesday, June 12, 2024
HomeHotstarHow we leveraged ChatOps to automate cloud introspection, to degree up our...

How we leveraged ChatOps to automate cloud introspection, to degree up our infrastructure sport!


Cloud Useful resource Explorer utilizing easy ChatOps

Must know your cloud dependencies in a pinch? Sure, we’ve been there. Right here’s how we leveraged ChatOps to make our lives simpler.

Photograph by Amy Elting on Unsplash

We’ve received cloud, and we’ve received 99 issues about what’s residing in our clouds. On most days you may need the luxurious of time to unravel this dependency graph, nonetheless, for those who’re chasing down an incident, you have to know in a rush! Right here’s how the Sentinels, which is our safety workforce @ Hotstar, solved this utilizing ChatOps.

A cloud Useful resource Explorer is without doubt one of the most necessary objects within the the toolkit for anybody who’s constructing in a contemporary engineering workforce. Whereas the explanations can differ, the necessity to know what resides the place and the metadata round it’s wanted with out a lot drama.

Right here’s what most individuals leverage at this time to find objects within the cloud and their challenges.

  1. Console : Doesn’t scale in Multi Account setup, advanced correlation not potential.
  2. CLI or SDK (e.g. Boto) : CLI wants setup like organising keys, position assume settings and so on.. SDK requires some programming consolation — doesn’t scale for workforce members who aren’t present with coding. A default downside which at all times exists with this technique is managing the keys at scale & their rotation.
  3. Cloud Stock or a Cloud Safety Posture Administration (CSPM) resolution: Focus of this device is safety, not a lot, stock. Subsequently the info is stale and might solely work as a rough technique, which could not serve all use-cases.

Whereas as a mixture this stuff would possibly work, this isn’t one thing that can be utilized in a pinch and would require stitching collectively of an answer.

What we wished to resolve is one thing seen at scale solely on a day-to-day foundation. For instance, somebody has a easy query, this somebody may very well be a buyer care govt, or a backend developer. Their query would possibly go one thing like :

“I need to know the place is x.x.x.x IP in our Infra”

IP Particulars Slash command in Execution

Our aim was to make it as simple as querying from an excel sheet or a easy database for individuals. Utilizing the standard strategies would fail for the easy undeniable fact that it might require stitching and extra work every time this query was requested, except you pooled collectively some tooling. Add the complexity of multi-cloud, and even a number of entry ranges and so forth, which is quite common. Typically, the head-wind to even reply a easy query like that is intimidating.

We started to introspect the questions that our groups have been asking. Here’s a sampling :

Which account does this S3 bucket belongs to & what kind of encryption is enabled on it?

For an entry key, which account & person this belongs to?

I need to know what factors to. Which account’s R53 to verify?

Every of those takes a special quantum of complexity to reply! Think about spinning up bespoke scripts to deal with every query, that is simply not scalable.

We extensively use Slack for speaking. ChatOps could be on any chat app for that matter. Anybody who retains questioning about numerous issues on Infrastructure involves slack first and asks somebody, a lot of the instances — it’s DevOps, Infrastructure & Safety Groups who will get these questions.

Our aim was easy — nothing ought to restrict somebody to ask a query and guarantee minimal dependency.

Querying cloud nonetheless stays the identical — it’s both CLI, SDK or utilizing current knowledge from a supply like CSPM which already pulls a lot of the knowledge for you.

When to make use of actual time queries vs utilizing CSPM Knowledge is dependent upon the use case and the way stay you count on the info to be. For instance I count on IP knowledge to be virtually stay(1–2hr window) as a whole lot of IPs preserve altering for numerous causes — Spot nodes, Auto Scaling and so on.. my IAM Knowledge could be 6-12 hour previous since person & entry key creation will not be that frequent. Equally pulling S3 or R53 knowledge may also be round 6–12 hours.

A easy structure diagram to elucidate how it’s constructed and used is right here —

Easy Structure Diagram for Cloud ChatOps

Parts within the structure:

Slack — That is the place somebody fires a Slash Command relying upon the data they want to get. This command could be fired from their DM or a devoted channel, the response involves a pre-defined channel.

API Router — That is the place a lot of the logic sits. It authenticates the Slack Consumer, Payload coming in & then routes it to corresponding API. Determination of whether or not to make use of CSPM API, ES or Actual Time CLI Question is taken right here. Response to Slack can also be given by this element. It is a easy Flask App.

CSPM API — This may be your CSPM, or another cloud stock service which pulls your posture knowledge each 24 hrs. It would have some API uncovered to question knowledge out of it, which can be utilized.

Customized Full Textual content Search — You need to use any full textual content right here, we used Elastic Search right here. Now we have few cron jobs working to drag knowledge and preserve it stay as a lot as potential. The frequency of Cron is dependent upon what sort of knowledge is being pulled from the Cloud. Like talked about earlier than — IAM knowledge could be pulled each 6-12 hrs, IP knowledge each 1 hour, so on and so forth. This frequency is dependent upon your setting & precedence given to sure sources.

Actual Time Queries — You possibly can fireplace customized queries both utilizing CLI, SDK similar to Boto or use a device like Steampipe.

Notice: Entry Management — Ofcourse everybody will not be allowed to see every part, we want to have some restrictions on what sort of knowledge could be queried by what class of individuals. Easy entry controls could be written primarily based on the Slack Consumer ID who fires the command. Group of Slack Consumer IDs could be allowed/denied to fireplace sure APIs.




Please enter your comment!
Please enter your name here

Most Popular

Recent Comments