Solution Architect – Databricks (Document AI & Knowledge Graph Focus)
Location: India – Pune
Company Overview
We are a global empathy-led technology services company where software and people transformations go hand-in-hand.
Product innovation and mature software engineering are part of our core DNA. Our mission is to help our customers accelerate their digital journeys through a global, diverse, and empathetic talent pool following outcome-driven agile execution. Respect, Fairness, Growth, Agility, and Inclusiveness are the core values that we aspire to live by each day.
We continue to invest in our digital strategy, design, cloud engineering, data, and enterprise AI capabilities required to bring a truly integrated approach to solving our client's most ambitious digital journey challenges.
About the Role
We are seeking a highly skilled Solution Architect with deep expertise in Databricks Lakehouse and proven experience operationalizing unstructured document AI pipelines in regulated industries. You will design and lead end-to-end architectures that turn complex, high-volume documents into governed, queryable knowledge graphs inside Databricks — enabling automation, compliance, and downstream AI applications.
This role bridges Lakehouse architecture, AI/LLM-based extraction, and human-in-the-loop governance to deliver production-ready solutions that Databricks customers can trust.
Key Responsibilities
- Architecture & Design – Lead the design and implementation of in-Lakehouse pipelines for unstructured and structured data, leveraging Delta Live Tables, Unity Catalog, and MLflow.
- Unstructured Data Processing – Architect solutions for ingestion, OCR, and LLM-based parsing of scanned PDFs, legal/medical records, and complex forms.
- Confidence Scoring & HITL Workflows – Design confidence-tiered pipelines that auto-accept high-confidence results and route low-confidence extractions to review consoles, ensuring auditability and compliance.
- Knowledge Graph Modeling – Translate extracted entities and relationships into graph-friendly Delta Gold tables, ready for analytics or export to graph databases.
- Governance & Compliance – Define security, lineage, and classification rules in Unity Catalog; ensure all document transformations are fully traceable back to the source.
- Integration & Ecosystem – Use Partner Connect and APIs to integrate Databricks outputs with downstream claims/case management, compliance dashboards, and BI tools.
- Performance & Cost Optimization – Tune pipelines for scalability, performance, and cost efficiency in cloud environments (AWS, Azure, GCP).
- Collaboration & Mentorship – Work with data engineers, ML engineers, and domain SMEs to translate business requirements into scalable architectures, mentoring teams on Databricks and document AI best practices.
Required Skills & Experience
- 5+ years in solution or data architecture, with at least 2+ years delivering Databricks-based solutions.
- Proven hands-on experience with Unity Catalog, Delta Live Tables, Databricks SQL, and artner Connect integrations.
- Expertise in designing Lakehouse architectures for structured and unstructured data.
- Strong understanding of OCR integration patterns (AWS Textract, Azure Form Recognizer, Tesseract) and LLM-powered entity extraction (prompt design, schema mapping, validation).
- Experience with confidence scoring and human-in-the-loop patterns for data quality and compliance.
- Familiarity with knowledge graph concepts and relational-to-graph data modeling in Databricks.
- Strong SQL skills and experience with distributed data processing (PySpark/SparkSQL).
- Working knowledge of cloud data ecosystems (AWS, Azure, or GCP).
- Excellent communication skills with the ability to bridge technical and business teams.
Preferred Qualifications
- Databricks certifications (e.g., Databricks Certified Solutions Architect, Data Engineer Professional).
- Experience delivering document AI pipelines in regulated verticals (legal, insurance, healthcare).
- Familiarity with data mesh or federated governance models.
- Background in MLOps and continuous improvement for extraction models.