Finding the best job has never been easier
Share
What you'll do:
Omnichannel eCommerce production support: Provide L1 and L2 production support for multiple cloud technologies such as Open stack, Cloud Native platform, Microsoft Azure, and Google Cloud Platform for triaging critical issues using various internal and vendor-related tools.
Develop Tools and support: Design and develop solutions for widespread internal communications for cloud applications support or workflows for infrastructure availability issues with various internal applications with multiple programming languages like Java, JavaScript (React, Node JS), Python and Shell programming technologies like Prometheus, Database Query languages. Design and develop a UI tool to display Item Content Quality data on a dashboard using AngularJS, HTML5 & CSS3.
Work on Product Enrichment & Content Services projects at Walmart: Develop enterprise monitoring and utilize tooling software solutions such as Grafana, Kibana, Splunk, Graphite, New Relic, to improve visibility, pro-actively detect issues and restore system availability.
Alert, Monitoring, Log analysis: Detect and analyze monitoring graphs and alerts to identify systems causing production impacts with various tools like Grafana, Prometheus, MMS, Kibana, Graphite, Service Now, JIRA, Dynatrace, New Relic, Omniture, Splunk, and CDN logs [Reduce MTTD – Mean Time to Detect].
Incident triage, Escalation and Resolution: Triage site-impacting production issues by quantifying impact, severity and urgency, analyzing systems for quick remediation, engaging the right teams for recovery [Reduce MTTE – Mean Time to Engage], and focusing on immediate restoration [ Reduce MTTR – Mean Time to Restore] of large-scale enterprise systems.
Enhance Alerting solutions: Design and implement JavaScript for the integration of alerting tool with service API endpoints with various tools like ServiceNow, Spotlight, Splunk, and xMatters.
What you'll bring:
Automation and Self-healing: Demonstrate knowledge of scripting and software development for automation and self-healing of multi-cloud environments. Help enhance existing solutions by developing automation with Docker, Kubernetes and working with DevOps and Engineering partners.
Excellent end to end technical understanding of core infrastructure, cloud services, platforms and micro-services.
Ability to effectively triage – be able to detect and determine symptom vs cause.
Identify and drive continuous improvement efforts to reduce waste (eliminate, automate or streamline).
Influence the design of system architecture and tactical solutions.
Familiar with log centric tooling, ideally Splunk. Produce time series data and reusable dashboards for use both during and post event.
These jobs might be a good fit