Senior Data Engineer
Job Title: Senior Data Engineer
Location (onsite): Vienna, VA
Pay Rate: $43-$53 per hour (W2)
Duration: 06 Months
Key Responsibilities:
- Build and Maintain Data Pipelines: Develop scalable data pipelines using PySpark and Spark within the Databricks environment.
- Implement Medallion Architecture: Design workflows using raw, trusted, and refined layers to drive reliable data processing.
- Integrate Diverse Data Sources: Connect data from Kafka streams, extract channels, and APIs.
- Data Cataloging and Governance: Model and register datasets in enterprise data catalogs, ensuring robust governance and accessibility.
- Access Control: Manage secure, role-based access patterns to support analytics, AI, and ML needs.
- Team Collaboration: Work closely with peers to achieve required code coverage and deliver high-quality, well-tested solutions.
- Optimize and Operationalize: Tune Spark jobs (partitioning, caching, broadcast joins, AQE), manage Delta Lake performance (Z-Ordering, OPTIMIZE, VACUUM), and implement cost and reliability best practices on AWS.
- Data Quality and Testing: Implement data quality checks and validations (e.g., Great Expectations, custom PySpark checks), unit/integration tests, and CI/CD for Databricks Jobs/Workflows.
- Infrastructure as Code: Provision and manage Databricks and AWS resources using Terraform (workspaces, clusters, jobs, secret scopes, Unity Catalog objects, S3, IAM).
- Monitoring and Observability: Set up logging, metrics, and alerts (CloudWatch, Datadog, Databricks audit logs) for pipelines and jobs.
- Documentation: Produce clear technical documentation, runbooks, and data lineage for governed datasets.
Required Skills & Qualifications:
- Databricks: 6-9 years of experience with expert-level proficiency
- PySpark/Spark: 6-9 years of advanced hands-on experience
- AWS: 6-9 years of experience with strong competency, including S3 and Terraform for infrastructure-as-code
- Data Architecture: Solid knowledge of the medallion pattern and data warehousing best practices
- Data Pipelines: Proven ability to build, optimize, and govern enterprise data pipelines
- Delta Lake and Unity Catalog: Expertise in Delta Lake internals, time travel, schema evolution/enforcement, and Unity Catalog RBAC/ABAC
- Streaming: Hands-on experience with Spark Structured Streaming, Kafka, checkpointing, exactly-once semantics, and late-arriving data handling
- CI/CD: Experience with Git-based workflows and CI/CD for Databricks (e.g., Databricks Repos, dbx, GitHub Actions, Azure DevOps, or Jenkins)
- Security and Compliance: Experience with IAM, KMS, encryption, secrets management, token/credential rotation, and PII governance
- Performance and Cost: Demonstrated ability to tune Spark jobs and optimize Databricks cluster configurations and AWS usage for cost and throughput
- Collaboration: Experience working in Agile/Scrum teams, peer reviews, and achieving code coverage targets
Preferred Skills & Qualifications:
- Certifications: Databricks Data Engineer Professional, AWS Solutions Architect/Developer, HashiCorp Terraform Associate
- Data Catalogs: Experience with enterprise catalogs such as Collibra or Alation, and lineage tooling such as OpenLineage
- Orchestration: Databricks Workflows and/or Airflow
- Additional AWS: Glue, Lambda, Step Functions, CloudWatch, Secrets Manager
- Testing: pytest, chispa, Great Expectations, dbx test
- Domain Experience: Analytics and ML feature pipelines, MLOps integrations