Sheetal Musku
Development
South Carolina, United States
Skills
Data Engineering
About
Sheetal Musku's skills align with System Developers and Analysts (Information and Communication Technology). Sheetal also has skills associated with Programmers (Information and Communication Technology). Sheetal Musku has 6 years of work experience.
View more
Work Experience
Hadoop Data Engineer
JP Morgan Chase
March 2020 - August 2022
- Understand the business requirements from modeling team and convert the SAS model to PYSPARK to optimize the code and improve the overall execution and performance of the model. As part of model implementation team, understand the flow of the SAS model, configure business context key file which consists of the source data table details which is required for the model execution. Develop extraction, transformation, and feature engineering modules in pyspark. Develop scoring module using pyspark and xgboost and validate the scores with SAS model by running a compare report. Responsible for writing the unit test cases. Using bitbucket to collaborate the developed code with the team and to track the code changes etc. Configure the model execution portal with model name, start time, day of the week on which the model needs to execute the model and develop the final scores. Environment: Spark Core, Pyspark, Python, Hive and Bitbucket
Hadoop Developer
Capital One
November 2018 - January 2020
- Performed ETL operations using multiple tools: Spark, Pig, MapReduce, and Hive mostly. Experience in designing, developing, and maintaining data processing pipelines using Cloudera technologies such as Apache Hadoop, Apache Spark, Apache Hive, and Python. Experience in developing and maintaining data ingestion frameworks for efficiently extracting, transforming, and loading data from various sources into the Cloudera platform. Proficient in using Cloudera manager web UI and its services Storing the processed data into Hive tables for faster querying Developed a customized application to schedule jobs using Oozie workflows/Coordinators. Writing scripts to automate the Oozie workflows. Developed multiple pyspark based ETL applications to encrypt and decrypt different columnar values based on configuration. Worked on different input file formats such as XML, JSON, and Text file. Used Avro, Parquet, and ORC file formats along with suitable compression techniques for optimizing read/write data to/from HDFS. Created custom Keys and custom Values while handling data in mappers and reducers based on input data and software requirements. Used different input file formats such as text input format, combine file input format, multi-input format and Avro input format for different applications. Regularly serve ad hoc request analysis with priority for day-to-day customer business needs. Experience in working within the Tableau environment to build Tableau Visualization dashboards and reports Experience in creating and maintaining data models, data artifacts, and metadata and produce data as needed by performing ad hoc queries, reports, and tables. Developed efficient solutions using Tableau Desktop and Server Develop and maintain Tableau Server and Online configurations Development of aggregate jobs and KPI computation jobs regularly. Used both Spark and Map reduce frameworks on development clusters. Migrated Map Reduce jobs to Spark framework and rewritten most of the existing MapReduce jobs using Pyspark for performance. Created different visualization dashboards in Kibana, and alerts are set up to notify the users about the status of the applications. Wrote UNIX/Linux shell scripts that operate on a set of files defined by the user to automate the recurring tasks. Applied different performance tuning techniques to improve the performance of existing jobs. Collaborated with data architects and scientists to build different data pipelines for consumption by other teams. Environment: MapReduce, CDH, Cloudera manager UI, Spark, Pyspark, Spark SQL, HDFS, Pig, Hive, Kibana, Elastic Search, FluentD, Oozie, XML, JSON, Unix/Shell Scripting, Oracle DB
SQL and Hive Developer
Tech Mahindra
August 2015 - April 2017
- Build and maintain SQL scripts, indexes, and complex queries for data analysis. Review and identify stored procedures, packages and functions for best practices, enhanced features, and better performance. Coded and tested SQL queries based off the persistent table using Inline Views, Dynamic SQL, Merge statements. Built and monitored indexes to bring down processing times from minutes to seconds. Writing ad-hoc queries to analyze the desired output from the relational model. Identify in-efficiently written SQL queries for fine tuning to improve the performance. Transforming business requirements into Technical Documentation, involved in analysis, database design, coding, and implementation. Loaded Giga-bites of data into oracle tables using External table, Table partitioning, SQL Loader and wrote batch scripts to automate the process. Extensively used Oracle PL/SQL language to develop complex stored packages, procedures, functions, triggers, Text queries and Text Indexes etc. to process raw data and prepare for the statistical analysis. Used TOAD, PL/SQL developer tools for faster application design and development. Extensively used Informatica for data integration using ETL by defining data integration logic and collaborated on the projects to accelerate project delivery. Used PL/SQL to validate data and to populate Inventory tables. Wrote complex SQL queries using joins and sub queries. Wrote Procedures and Functions using oracle for managing and maintaining inventory. Involved in design development of report generation using Oracle. Based on the requirement, developed necessary forms and reports. Created indexes to improve the system performance. Provided support by troubleshooting and solving issues. Worked on Hadoop Cluster with the total cluster size of 56 Nodes and 896 Terabytes capacity. Used data ingestion tools like Sqoop to Import /Export data between Hadoop and relational databases. Support code/design analysis, strategy development and project planning. Involved in Requirement Analysis, Design, and Development. Involved in creating Hive tables, loading data, and writing Hive queries which will convert internally into MapReduce jobs. Load and transform large sets of structured and semi structured data. Load data into Hive partitioned tables. Environment: Oracle 11g, PL /SQL, Shell scripting, MS SQL-Server 2008, Oracle Forms and Reports 6i, TOAD-11.5.0.56, SQL Developer-3.1.07, Informatica, MapReduce, YARN, Hive, Pig, Sqoop, Pyspark and Python.
AWS Data Engineer
J.P Morgan Chase
January 2023 - Present
- Develop python-based applications which consume the data from Streaming Data Platform in real time and then process this data to external vendors. Develop and run serverless Spark based applications using AWS Lambda service and Pyspark to compute metrics for various business requirements. Develop python, shell scripting and spark-based applications using PyCharm and Anaconda integrated development environment. Write the python script to automate the tasks of assuming the IAM roles and copying s3 files from cat2 environment to cat1. Use Git as a version control tool for maintaining software and Jenkins as continuous integration and continuous deployment tool for deploying applications in production servers. Push AWS EMR and AWS Lambda logs to Elastic search using Td agent for log analysis purposes. Push application logs from EMR/EC2 instances to Elastic search and create dashboards out of these to monitor the health of our applications. Integrate these Kibana dashboards with PagerDuty alerts to be notified about the job failures via email or slack channels. Analyze failed jobs in AWS EMR Spark cluster and identify the cause of failures and improve the job performance and minimize/avoid failures using pyspark. Analyze, store and process data captured from different sources using different AWS cloud services such as AWS S3, Cloud Formation, Lambda and EMR services. Tokenize the customers NPI and PCI data that is arrived from different vendors in a secured environment and then share it through less secured environment. Write python script to write the files to onelake and get the files from onelake from/to EMR. Well versed in creating Athena tables using Glue crawler and developing complex SQL queries to expose this data to Data Scientists and data analysts. Having very good knowledge in creating and modifying the S3 bucket policies, creating IAM roles and polices, giving read and write access to users on our s3 buckets etc. Use Airflow extensively for developing, scheduling, and monitoring the batch-oriented workflows. Closely monitor the operations of the airflow workflows using Airflow health check and integrated airflow with Sentry to be able to receive real time error notifications in production. Used service now to create incidents and change orders to create and assign the issues and to move any code changes to production. Expertise in creating and managing data integration jobs using AWS Glue. Used AWS Glue console to add, edit, delete, and test connections. Used crawler to populate the AWS Glue Data Catalog with tables. Created Athena tables using Glue crawler. Schedule ETL jobs that reads from and writes to the data stores that are specified in the source and target Data Catalog tables. Provide support in resolving the application related production issues. Develop and maintain data warehouse and ETL processes, ensuring data quality and integrity. Actively involved and supported the team during application deployments/release to production environment and tackled/resolved production issues within the provided SLA. Experience in production support with respect to software upgrades for the applications running in prod. Monitored/Supported applications in production environment and made sure all the applications are up and running. Experience in creating dashboards using the production logs and creating alerts to notify the corresponding team members about the status of the applications in production. Use Jenkins pipeline to deploy our applications to production and involved in migrating all our pipelines from retail Jenkins to bogie Jenkins. Experience in using AROW and Autosys automation tools to schedule and monitor the jobs. Automate jobs using shell scripting and schedule those jobs to run at a specific time using AROW. Use JIRA tool for updating my tasks/stories created by agile lead. Test and validate the developed applications in development and QA environments and deploy them in Production environment. Environment: Spark Core, AWS S3, EMR, Lambda, CloudFormation, Cloud-Watch, Pyspark, Python, Hive, Crontab, Elastic Search, Kibana, Arow, Autosys and Airflow