Close this
Close this

Mani Khullar

Development
NJ, United States

Skills

Data Engineering

About

MANI KHULLAR's skills align with System Developers and Analysts (Information and Communication Technology). MANI also has skills associated with Database Specialists (Information and Communication Technology). MANI KHULLAR has 9 years of work experience.
View more

Work Experience

Sr. Data Engineer

Warner Bros Discovery
August 2021 - Present
  • Responsibilities: Designed, implemented, and maintained data integration and ETL (Extract, Transform, Load) pipelines to move and transform data between various systems and databases. Ensured data quality, consistency, and reliability by performing data validation and error handling within integration workflows. Created and managed workflows using Apache Airflow, scheduling and orchestrating data integration tasks, ensuring timely execution, and monitoring job statuses. Customized and optimized Airflow DAGs (Directed Acyclic Graphs) to meet specific data pipeline requirements.. Utilized Snowflake data warehousing platform for data storage and processing, including creating and managing databases, schemas, and tables. Worked on Snowflake stages for data ingestion and developed efficient data loading strategies. Developed and optimized complex SQL queries to extract, transform, and analyze data from databases. Implemented performance tuning techniques to improve query execution times and optimize database performance. Wrote Python scripts and code to automate data extraction, transformation, and loading processes. Developed custom Python functions and modules to perform data manipulations and transformations. Managed data transfers between systems via SFTP and API integrations. Dealt with various data file formats, including Parquet, CSV, SAV, and GZ, ensuring compatibility and efficient processing. Conducted data type conversions and transformations to prepare data for analysis and reporting. Ensured compatibility between data types used in source and target systems

Samach Innovations LLC
August 2015 - Present
  • Location- Remote

Sr. Data Engineer

Continental Resources
April 2020 - August 2021
  • Responsibilities: Implemented solutions for ingesting data from various sources and processing the Data- Confidential -Rest utilizing Big Data technologies such as Hadoop, Map Reduce Frameworks, HBase, Hive with Cloud Architecture. Experience in Python, Django, REST API's, AWS. Construct and maintain an appropriate, Pythonble, and easy-to-use infrastructure with various tools to support the development of actionable reports used in decision-making across the strategy team. Worked on AWS, implementing solutions using services like (EC2, S3, RDS, VPC, Lambda). Developed Spark code using Python and Spark-SQL for faster testing and data processing. Performed data profiling and transformation on the raw data using Pig, Python, and Java. Used Apache Spark for batch processing to source the data. Expert in performing business analytical scripts using Hive SQL. Importing and exporting data into HDFS and Hive using Sqoop. Analyzed, designed, developed, implemented, and maintained Parallel jobs using IBM info sphere Data stage. Involved in design of dimensional data model - Star schema and Snowflake Schema. Load and transform large sets of structured, semi structured, and unstructured data. Pulling the data from data lake (HDFS) and massaging the data with various RDD transformations. Generating DB scripts from Data modeling tool and Creation of physical tables in DB. Used the ETL Data Stage Director to schedule and running the jobs, testing, and debugging its components & monitoring performance statistics. Experienced in PX file stages that include Complex Flat File stage, Dataset stage, LookUp File Stage, Sequential file stage. Created some routines (Before-After, transform function) used across the project. Experienced in PX file stages that include Complex Flat File stage, Dataset stage, LookUp File Stage, Sequential file stage. Experienced in developing parallel jobs using various Development/debug stages (Peek stage, Head & Tail Stage, Row generator stage, Column generator stage, Sample Stage) and processing stages (Aggregator, Change Capture, Change Apply, Filter, Sort & Merge, Funnel, Remove Duplicate Stage). Repartitioned job flow by determining DataStage PX best available resource consumption. Successfully implemented pipeline and partitioning parallelism techniques and ensured load balancing of data. Involved in creating UNIX shell scripts for database connectivity and executing queries in parallel job execution. Document all the changes implemented across all systems and components using Confluence and Atlassian Jira. Documentation includes Technical changes, Infrastructure changes, and Business Process changes. Post Release documentation would also include Known Issues from Production Implementation and Deferred defect. Last 12 months i am working on snowflake to implement UDF's and reading data from s3 to snowflake to generate Datasets Environment: DataStage, Netezza, E3 Framework, Unix scripting, Hadoop 3.0, HBase 1.2, Hive 2.3, AWS, EC2, S3, RDS, VPC, MySQL, Redshift, Sqoop, HDFS, Spark, ETL,, Python, UDF, NoSQL.

Data Engineer

Panasonic (Remote)
January 2019 - March 2020
  • Responsibilities: Designing the Pipeline: Determine the data sources, transformations, and destination for your pipeline. This includes identifying the required AWS Glue components such as crawlers, jobs, and triggers. Setting up Data Sources: Configure AWS Glue to connect to your data sources, which can include various databases, data lakes, or streaming services. Configure the necessary access permissions and connectivity options, Worked on the Pandas and Lambda. Defining Data Transformations: Create and configure AWS Glue jobs to perform the required transformations on your data. This may involve data cleaning, data enrichment, or data aggregation tasks, depending on your specific pipeline requirements. Building ETL Workflows: Use AWS Glue to define the order and dependencies of your data transformations. This includes creating ETL workflows or DAGs (Directed Acyclic Graphs) to ensure data is processed in the correct sequence. Configuring Scheduling and Triggers: Set up scheduling options for your pipeline, determining how often data should be processed and transformed. Additionally, configure triggers to automate pipeline execution based on specific events or conditions. Monitoring and Troubleshooting: Regularly monitor the pipeline's execution and performance using AWS Glue monitoring tools. Troubleshoot any issues that arise, such as data quality problems, job failures, or connectivity errors. Managing Security and Access: Ensure proper security measures are in place for your pipeline, including data encryption, access controls, and compliance with relevant policies and regulations. Scaling and Optimization: Monitor the pipeline's performance and optimize it as needed to handle increasing data volumes or improve efficiency. This may involve scaling AWS Glue resources, tuning job configurations, or leveraging advanced features like partitioning and parallelism. Documentation and Collaboration: Document the pipeline architecture, configurations, and workflows for future reference. Collaborate with other team members or stakeholders to ensure alignment with business requirements and address any feedback or changes.Performed Data Cleaning, features scaling, featurization, features engineering and deploying the data in amazon s3 and Athena. Utilized AWS CLI to automate backups of ephemeral data-stores to S3 buckets and Migrated applications from internal data center to AWS Athena and Glue. Strong Experience in implementing Data warehouse solutions in Confidential Redshift; Worked on various projects to migrate data from on premise databases to Confidential Redshift, RDS and S3. Involved in all the stages of Software Development Life Cycle Primarily in Database Architecture, Logical and Physical modeling, Data Warehouse/ETL development using MS SQL Server 2012/2008R2/2008, Oracle 11g/10g, and ETL Solutions/Analytics Applications development. Experience with Unix/Linux systems with scripting experience and building data pipelines. Hands on experience in writing Python and Bash Scripts. Extensive experience in designing and implementation of continuous integration, continuous delivery, continuous deployment through Jenkins. Designed and Developed ETL Processes in AWS Glue to migrate Campaign data from external sources like S3, ORC/Parquet/Text Files into AWS Redshift. Data Extraction, aggregations, and consolidation of Adobe data within AWS Glue using PySpark.. Used various Spark Transformations and Actions for cleansing the input data and involved in using the Spark application master to monitor the Spark jobs and capture the logs for the spark jobs. Experience in refactoring the existing spark batch process for different logs written in Python. Hands-on work developing in SAS, SQL, Python, and Java with Eclipse for extraction patterns from very large datasets and transform data into an informational advantage for decision support. Performed and assisted in design, development and testing of predictive analytics models that includes large data collection, data organization, text segmentation, categorization, summarization, and topic modeling. Advanced statistical analysis in SAS and predictive solutions. Implemented Big Data tools like Spark using Python and utilizing Data frames and Spark SQL API for faster processing of data and worked on extensible framework for building high performance batch and interactive data processing application on hive. Debugging and maintenance of automaton test scripts in batch mode and implemented a plan on automation scripts on based on Sprint. Develop Oozie workflows to schedule the Scripts on daily basis. Environments: Hadoop/Big Data Technologies: Spark-Python, Kafka, Spark Streaming, Mlib, Sqoop, Hbase, HDFS, Map Reduce, Pig, Hive, AWS Glue ,Zeppelin(Distributions Data Bricks, Horton works and Cloudera), Cassandra, HBase, HDFS, MapReduce, Hive, Pig, Sqoop, Flume, Oozie, JDBC, Apache, Shell Scripting, Pandas, Lambda.

Data Engineer

Florida Blue
October 2017 - December 2018
  • Responsibilities: Proficient in designing and creating various Data Visualization Dashboards, worksheets, and analytical reports to help users to identify critical KPIs and facilitate strategic planning in the organization utilizing Tableau Visualizations according to the end user requirements. Determined operational objectives by studying business functions; gathering information; evaluating output requirements and formats. Coordinated with team and Developed framework to generate Daily adhoc, Report's and Extracts from enterprise data and automated using Oozie. Optimizing of existing algorithms in Hadoop using Spark Context, Spark-SQL, Data Frames and Pair RDD's. Worked on cloud deployments using maven, docker and Jenkins. Designed and Co-ordinated with Data Science team in implementing Advanced Analytical Models in Hadoop Cluster over large Datasets. Developed the PySpark code for AWS Glue jobs and for EMR. Involved in developing various ETL jobs to load, extract and map the data from flat files and heterogeneous database sources like Oracle, SQL Server, MySQL. Involved in developing various ETL jobs to load, extract and map the data from flat files and heterogeneous database sources like Oracle and DB2. Hands of experience in GCP, Big Query, GCS bucket, G - cloud function, cloud dataflow, Pub/suB cloud shell, GSUTIL, BQ command line utilities, Data Proc, Stack driver. Experience in building and architecting multiple Data pipelines, end to end ETL and ELT process for Data ingestion and transformation in GCP, Big Query and coordinate task among the team. Develop and deploy the outcome using spark and Python code in Hadoop cluster running on Big Query and GCP. Created DDL's for tables and executed them to create tables in the warehouse for ETL data loads. Exporting the analyzed and processed data to the RDBMS using Sqoop for visualization and for generation of reports for the BI team. Good Knowledge of web services using SOAP and REST protocols. Expertise in developing data driven applications using Python 2.7, Python 3.0 on PyCharm and Anaconda Spyder IDE's. Writing Technical documents and mentoring global UNIX team. Proficient in all aspects of software life cycle like Build/Release/Deploy and specialized in cloud automation through open-source DevOps tools like Jenkins. Hands on experience in writing Python and Bash Scripts. Dockized applications by creating Docker images from Docker file. Extensive experience in designing and implementation of continuous integration, continuous delivery, continuous deployment through Jenkins. Periodic patch management on Unix/Linux Environment. Created reports using Tableau and Power BI to help forecast the provider information. Used Postman & SOAPUI for rest service testing. Created SQL scripts to insert/update and delete data in MS SQL database. Created database tables, wrote stored procedures to update and clean the old data and also helped the front-end application developers with their queries. Extracted data from the legacy system and loaded/integrated into another database through the ETL process. Experience with Azure transformation projects and Azure architecture decision making Architect and implement ETL and data movement solutions using Azure Data Factory (ADF), SSIS. Database, Azure Data Lake(ADLS), Azure Data Factory(ADF) V2, Azure SQL Data Warehouse, Azure Service Bus, Azure key Vault, Azure Analysis Service(AAS), Azure Blob Storage, Azure Search, Azure App Service,Azure data Platform Services. Azure Data Factory(ADF),Integration Run Time(IR),File System Data Ingestion, Relational Data Ingestion. Executed SQL queries to test back end data validation of DB2 database tables based on business requirement. Recommended controls by identifying problems; writing improved procedures for the portal. Designed and created different ETL packages using SSIS and transferred data from Oracle source to MS SQL server destination. Performance tuning of SQL queries and stored procedures using SQL profiler and index tuning advisor. Created T-SQL queries for schemas, views, stored procedures, triggers and functions for data migration. Involved in the project from planning stage to pushing codes to production. Scheduled Cube Processing from Staging Database Tables using SQL Server Agentusing SSAS. Translated technical applications specification into functional and nonfunctional business requirements and created user stories based on those requirements in Rally. Created dashboards, worksheets, storyboards for the stake holders using Tableau and Excel. Environment: Gcp, Bigquery, Gcs, Big Query Bucket, G-Cloud Function, Apache Beam, Cloud Dataflow, Cloud Shell, Gsutil, Bq Command Line Utilities, Dataproc, Vm Instances, Cloud Sql, Mysql, Posgres, Sql Server, Salesforce Soql, Python, Azure Data Factory(ADF), Azure Database migration Service(DMS), ETL SQL Server Integration Services (SSIS), SQL Server Reporting Services(SSRS), ETL Extract Transformation and Load., Business Intelligence(BI),BCPPython, Spark, Hive, Sqoop, Spark- MS SQL Server 2005/2008, SQL Server.

Data Engineer

Haley and Aldrich
August 2015 - September 2017
  • Responsibilities: Worked on development of data ingestion pipelines using ETL tool, Talend & bash scripting with big data technologies including but not limited to Hive, Impala, Spark, Kafka, and Talend. Experience in developing Pythonble & secure data pipelines for large datasets. Gathered requirements for ingestion of new data sources including life cycle, data quality check, transformations, and metadata enrichment. Supported data quality management by implementing proper data quality checks in data pipelines. Delivered data engineer services like data exploration, ad-hoc ingestions, subject-matter-expertise to Data scientists in using big data technologies. Build machine learning models to showcase Big data capabilities using Pyspark and MLlib. Enhancing Data Ingestion Framework by creating more robust and secure data pipelines. Implemented data streaming capability using Kafka and Talend for multiple data sources. Worked with multiple storage formats (Avro, Parquet) and databases (Hive, Impala, Kudu). S3 - Data Lake Management. Responsible for maintaining and handling data inbound and outbound requests through big data platform. Working knowledge of cluster security components like Kerberos, Sentry, SSL/TLS etc. Involved in the development of agile, iterative, and proven data modeling patterns that provide flexibility. Knowledge on implementing the JILs to automate the jobs in production cluster. Troubleshooted user's analyses bugs (JIRA and IRIS Ticket). Worked with SCRUM team in delivering agreed user stories on time for every Sprint. Worked on analyzing and resolving the production job failures in several scenarios. Implemented UNIX scripts to define the use case workflow and to process the data files and automate the jobs. Knowledge on implementing the JILs to automate the jobs in production cluster. Project MCE Description B uild Analytics to maintain in house salesforce. Analyze business process and identify area of improvement. Maintain the date warehouse using SCD TYPE 2 Language Python, SQL Framework/ ADF, Data Bricks, Synapse, PySpark Library Responsibilities Developed pipeline which fetches all the Customer data from SFTP location which then transforms and loads into data vault model. PySpark is used for processing of data Environment: Spark, Redshift, Python, HDFS, Hive, Pig, Sqoop, Python, Kafka, Shell scripting, Linux, Jenkins, Eclipse, Git, Oozie, Talend, Agile Methodology.