Expoint - all jobs in one place

The point where experts and best companies meet

Limitless High-tech career opportunities - Expoint

Capital One Senior Lead Engineer - Generative AI Infrastructure Remote-Eligible 
United States, New York, New York 
560078470

26.06.2024
NYC 299 Park Avenue (22957), United States of America, New York, New York Senior Lead Engineer - Generative AI Infrastructure (Remote-Eligible)

We are looking for an experienced Sr. Lead Engineer, Generative AI Infrastructure to help us build the foundations of our AI capabilities. You will work on a wide range of initiatives, whether that’s building large-scale distributed training clusters, or deploying LLMs on GPU instances for real-time applications and decisioning systems, or supporting cutting-edge AI research and development, all in our public cloud infrastructure. You will work closely with our cloud and container infrastructure teams as well as our world-class team of AI researchers to design and implement key capabilities. Examples of projects you will work on:

  • Deploy a thousand-node training cluster optimizing storage and networking stack, with tightly coupled training pipelines to take advantage of multiple parallelism strategies, in our public cloud.

  • Design and build fault-tolerant infrastructure to support long-running large-scale training tasks reliably despite failure of individual nodes, using containers and check-pointing libraries.

  • Design and build run-time infrastructure for serving large ML models such as LLMs and FMs in our public cloud.

  • Build infrastructure for deploying search indexes and embeddings in vector databases that will work closely with the rest of our capabilities.

Basic Qualifications:

  • Bachelor's degree in Computer Science, Computer Engineering or a technical field

  • At least 8 years of experience designing and building data-intensive solutions using distributed computing

  • At least 8 years of experience programming with Python, Go, Scala, or Java

  • At least 1 year of experience with HPCs, vector embedding, or semantic search technologies

  • At least 1 year of experience building, scaling, and optimizing training or inferencing systems for deep neural networks

Preferred Qualifications:

  • Master's or Doctoral degree in Computer science, Computer Engineering, Electrical engineering, Mathematics, or a similar field.

  • Background in machine learning with experience in large scale training and deployment of deep neural nets and/or transformer architectures.

  • Experience with machine learning frameworks such as TensorFlow or Pytorch, Lightning, Mosaic ML etc.

  • Ability to move fast in an environment with ambiguity at times, and with competing priorities and deadlines.

  • Experience at tech and product-driven companies/startups preferred.

  • Ability to iterate rapidly with researchers and engineers to improve a product experience while building the foundational capabilities.

  • Familiarity with deploying large neural network models in demanding production environments.

  • Experience with building GPU clusters in the public cloud with tightly-coupled storage and networking.

This role is also eligible to earn performance based incentive compensation, which may include cash bonus(es) and/or long term incentives (LTI). Incentives could be discretionary or non discretionary depending on the plan.

. Eligibility varies based on full or part-time status, exempt or non-exempt status, and management level.

If you have visited our website in search of information on employment opportunities or to apply for a position, and you require an accommodation, please contact Capital One Recruiting at 1-800-304-9102 or via email at . All information you provide will be kept confidential and will be used only to the extent required to provide needed reasonable accommodations.