Key Responsibilities
Software Testing & QA Leadership
- Design, review, and lead the implementation of test plans, test cases, and test strategies for various software components (APIs, services, UI).
- Oversee test automation script development using tools such as PyTest, Selenium, Playwright, or Postman.
- Maintain and optimize test automation pipelines, integrating with CI/CD tools (e.g., Jenkins, GitLab CI, Azure DevOps).
- Lead functional, regression, smoke, and performance testing efforts to validate system readiness.
- Ensure traceability from requirements to test cases and bug reports.
LLM Evaluation & Benchmarking
- Lead a team responsible for the evaluation of Large Language Model (LLM) outputs.
- Design capability-based evaluation benchmarks (e.g., summarization, reasoning, math, code generation).
- Guide the development and execution of auto-evaluation scripts, using LLM-as-a-judge, rule-based, and metric-based methods.
- Build and maintain evaluation pipelines to track model accuracy, hallucination rates, robustness, and more.
- Collaborate closely with AI Engineers and Data Scientists to align evaluations with development priorities.
Team Leadership & Technical Coaching
- Mentor and support a team of QA engineers and model evaluators.
- Allocate tasks, define sprint goals, and ensure timely and high-quality delivery of testing and evaluation artifacts.
- Foster a culture of test-first thinking, technical quality, and continuous improvement.
- Communicate evaluation insights and quality reports to product managers and stakeholders.
Required Qualifications
- Bachelor's or Master’s degree in Computer Science, Software Engineering, AI, or a related field.
- Minimum 5+ years in software testing, including experience as a Senior QA Engineer or Test Lead.
- Strong experience in test case writing, test scenario design, and test automation scripting.
- Proficiency in scripting languages like Python, JavaScript, or Java for test automation.
- Experience with tools such as PyTest, Selenium, JUnit, Playwright, Postman, etc.
- Familiarity with LLMs (e.g., DeepSeek, Mistral, LLaMA) and AI evaluation metrics (BLEU, ROUGE, Accuracy, etc.).
- Experience in building or maintaining benchmark datasets for AI evaluation.
- Understanding of prompt engineering, response validation, and error case analysis.
Preferred Skills
- Experience with LLM evaluation libraries/tools like OpenAI Evals, TruLens, LangChain Eval, or custom scripts.
- Experience working with MLOps or AI pipelines and integrating tests within them.
- Familiarity with dataset labeling platforms or human-in-the-loop evaluation systems.
- Strong data analysis and reporting skills using Excel, Python (Pandas/Matplotlib), or dashboards.
- Ability to define and customize evaluation logic per customer or business domain.