Senior Data Engineer

6d6 days ago

Photon

US · Full-time · $40,000 – $140,000

About this role

As a Senior Data Engineer, you will be responsible for breaking down data silos. This role focuses on building a unified, high-performance data layer using Data Federation techniques. Architect a Data Lakehouse environment where disparate sources feel like a single, cohesive database for analytics and AI teams.

Design and implement federated query layers using Starburst/Trino for high-speed analytics across distributed data sources without unnecessary data movement. Build scalable, distributed ETL/ELT pipelines with Python and Apache Spark (PySpark). Manage modern table formats like Delta Lake, Apache Iceberg, or Hudi for ACID transactions in the data lake.

Optimize Spark jobs and SQL queries across the federation layer to minimize latency and manage compute costs. Implement fine-grained access control and data masking within the federation engine. Ensure data privacy across all connected platforms for analytics and AI teams.

Leverage cloud platforms like AWS (EMR, S3, Glue), Azure (Databricks, ADLS), or GCP. Apply expert SQL skills and data modeling proficiency in Star/Snowflake schemas and Medallion Architecture. Explore bonus skills in IaC, dbt, Kubernetes, Data Mesh, or Data Fabric.

Requirements

5+ years of experience with Python and deep expertise in Apache Spark tuning (partitioning, shuffling, caching)
Hands-on experience with Starburst Enterprise, Trino (Presto), or Dremio
Proven track record working with Delta Lake or Iceberg architectures
Extensive experience with AWS (EMR, S3, Glue), Azure (Databricks, ADLS), or GCP
Expert-level SQL skills for complex analytical queries and query plan analysis
Proficiency in designing Star/Snowflake schemas and understanding Medallion Architecture (Bronze, Silver, Gold layers)

Responsibilities

Design and implement federated query layers (e.g., Starburst/Trino) to allow high-speed analytics across distributed data sources without unnecessary data movement
Build scalable, distributed data processing pipelines using Python and Apache Spark (PySpark)
Manage and optimize modern table formats like Delta Lake, Apache Iceberg, or Hudi to bring ACID transactions to the data lake
Optimize Spark jobs and SQL queries across the federation layer to minimize latency and manage compute costs
Implement fine-grained access control and data masking within the federation engine to ensure data privacy across all connected platforms