AI Evaluation Engineer

Hudson Manpower Logo
Hudson Manpower
30.00 - 50.00 USD / Hour
  • IT
  • FullTime
  • Applications have closed

++Overview++ :
Looking for an AI Evaluation Engineer with deep expertise in LLM benchmarking and evaluation frameworks. The candidate will be responsible for designing, automating, and executing structured evaluations to assess model quality, safety, performance, and cost. This role plays a critical part in ensuring reliable, scalable, and enterprise-ready Generative AI solutions in a fully remote environment.

++Location:++ ++Bellevue, WA (Remote)++ ++Duration:++ ++6+ Months++ ++Work Authorization:++ ++USC, GC, GC EAD, H4 EAD, TN++ ++Interview Mode:++ ++Video Interview++

++Job Summary++

We are seeking an experienced AI Evaluation Engineer to design, automate, and execute large language model (LLM) evaluation and benchmarking frameworks for Generative AI systems. This role focuses on assessing model quality, safety, performance, latency, and cost across Azure OpenAI and other GenAI platforms. The ideal candidate has strong hands-on experience with evaluation metrics, prompt testing, and Python-based automation, ensuring enterprise-grade and reliable AI outputs.

++Key Responsibilities++

  • Design and execute structured LLM evaluation (Eval) test suites to measure accuracy, relevance, safety, latency, and cost

  • Perform hands-on benchmarking and comparative analysis of Generative AI models

  • Build and maintain automated evaluation pipelines using Python

  • Create and manage datasets, benchmarks, and ground-truth references

  • Conduct structured prompt testing using Azure OpenAI and OpenAI APIs

  • Analyze hallucinations, bias, safety, and security risks in LLM outputs

  • Establish baselines and compare multiple models and prompt strategies

  • Ensure reproducibility and consistency of evaluation results

  • Document evaluation methodologies, metrics, and findings

  • Collaborate with AI/ML engineers, product teams, and stakeholders

  • AI Evaluation

  • LLM Benchmarking

  • Azure OpenAI

  • OpenAI Evals

  • Prompt Engineering

  • Prompt Testing

  • Evaluation Metrics

  • Hallucination Analysis

  • Python Automation

  • Generative AI Testing

++Must-Have Hands-On Experience (Critical)++

  • LLM evaluation and benchmarking for Generative AI models

  • Designing and executing Eval test suites

  • Automated evaluation pipeline development using Python

  • Working with Azure OpenAI and structured prompt testing

  • Creating datasets, benchmarks, and ground-truth references

++Required Skills++

Technical Skills

  • Azure OpenAI / OpenAI APIs

  • LLM evaluation and benchmarking frameworks

  • Evaluation metrics: Precision, Recall, F1, BLEU, ROUGE, hallucination rate, latency, cost

  • Prompt engineering: zero-shot, few-shot, and system prompts

  • Python for automation, batch evaluation execution, and data analysis

  • Evaluation tools and frameworks:

  • OpenAI Evals

  • HuggingFace Evals

  • Promptfoo

  • RAGAS

  • DeepEval

  • LM Evaluation Harness

  • AI safety evaluation, bias testing, and security assessment

++Functional Skills++

  • Test design and test automation

  • Reproducible evaluation pipeline design

  • Model comparison and baseline creation

  • Strong analytical and problem-solving skills

  • Clear technical documentation and reporting

  • Cross-functional collaboration with AI/ML and product teams