Srilakshmi Cheeti
Development
Texas, United States
Skills
Data Engineering
About
SRILAKSHMI CHEETI's skills align with System Developers and Analysts (Information and Communication Technology). SRILAKSHMI also has skills associated with Database Specialists (Information and Communication Technology). SRILAKSHMI CHEETI has 7 years of work experience.
Work Experience
AWS Data Engineer
Crescendo Bioscience
June 2022 - Present
- Responsibilities: * Collaborated on a Scala code base related to Apache Spark, executing Actions and Transformations on RDDs, Data Frames, and Datasets using Spark SQL and Spark Streaming Contexts. * Conducted Data blending and preparation using Alteryx and SQL for Tableau consumption, publishing data sources to the Tableau Server. * Developed Spark applications with Pyspark and Spark-SQL for data extraction, transformation, and aggregation across multiple file formats. * Extracted data from Hadoop tools onto the cluster using Sqoop, contributing to efficient data warehousing processes. * Developed Angular views to hook up models to the DOM and synchronize data with the server as a SPA. * Converted the Manual Report system to a fully automated CI/CD Data Pipeline, ingesting data from various Marketing platforms into the AWS S3 data lake. * Involved in creating AWS Pipelines by extracting customers' Big Data from various data sources into Hadoop HDFS. This included data from Excel, Flat Files, Oracle, SQL Server, Teradata, and log data from servers. * Implemented Apache Airflow for authoring, scheduling, and monitoring Data Pipelines. * Contributed to the development of a test environment on Docker containers and configured Docker containers using Kubernetes. * Created action filters, parameters, and calculated sets for preparing Power BI dashboards and worksheets. * Developed a data pipeline using Spark, Scala, and Apache Kafka to ingest data from the CSL source and store it in an HDFS-protected folder. * Developed views and templates with Python and Django for a user-friendly website interface. * Utilized Python programming for data processing and handling Data integration between On-prem and Cloud DB or Datawarehouse. * Created two-way binding AngularJS Components and provided access to the server-side through Backbone's API. * Set up Infrastructure as a code (IaaS) in the AWS cloud platform using CloudFormation templates and integrating appropriate AWS services as per business requirements. * Utilized AWS EMR for extensive data transformation tasks, utilizing Spark SQL with Scala and Python interfaces to proficiently manage Resilient Distributed Datasets (RDDs) and generate output responses. * Executed machine learning algorithms in Python on AWS, forecasting user order quantities by leveraging Kinesis Firehose and S3 Data Lake. * Constructed Extract, Transform, and Load (ETL) pipelines using a blend of Python and Snowflake's Snow SQL, augmenting data warehouse capabilities. * Ensured the development process followed SDLC (Software Development Life Cycle) guidelines. * Developed ETLs using PySpark, employing both Data frame API and Spark SQL API. * Optimized workflows, scheduled ETL jobs, and utilized Apache Airflow components like Pool, Executors, and multi-node functionality. * Worked on continuous integration using tools like Jenkins and SVN as the version control tool. * Effectively handled data errors during the modification of existing reports and manual creation of new reports using Tableau. * Proficient in deploying machine learning models to production environments using platforms like Docker, Kubernetes, or cloud services. * Experience in Amazon AWS services such as EMR, EC2, S3, CloudFormation, and RedShift for fast and efficient processing of Big Data. * Created and maintained a data lake across AWS using S3, lambda, glue, DynamoDB, Elasticsearch, CloudWatch, and Athena. * Used Informatica Power Centre to Extract, Transform, and Load data into Netezza Data Warehouse from various sources like Oracle and flat files. * Exploring with Spark to improve the performance and optimization of the existing algorithms in Hadoop using Spark context, Spark SQL, Data Frame, and Spark Yarn. * Deployed and managed Big Data Hadoop applications on AWS, demonstrating expertise in event-driven and scheduled AWS Lambda functions to trigger various AWS resources. * Implemented GitHub actions workflows to deploy Terraform templates into AWS. * Involved in developing a test environment on Docker containers and configuring Dockers and containers using Kubernetes. * Developed robust data ingestion pipelines leveraging AWS Lambda, AWS Glue, and Step Functions. Integrated cleansing and transformation steps to enhance data processing efficiency. * Executed high-performance data ingestion pipelines using Apache Spark on AWS Databricks, streamlining the processing and analysis of data from diverse sources. * Exported and imported data from file formats and Excel Spreadsheets by creating SSIS Packages and participated in the development process following the Agile methodology. * Used Sqoop to ingest the DB2, Teradata, and SQL server data to the Hadoop layer. Based on the requirement loaded the data to hive partitioned tables. * Implemented solutions for ingesting data from various sources and processing data at rest using Big Data technologies such as Hadoop, Map Reduce Frameworks, HBase, and Hive.
AWS Data Engineer
First Nationwide Bank
August 2019 - May 2022
- Responsibilities: * Developed and deployed Spark application using Pyspark to compute popularity score for all the contents using an algorithm and load the data into Elastic Search for the App content management team to consume. * Using Python and shell scripts to automate the process of running the model on new data as required and save the results to final Phoenix tables. * Worked extensively on building and automating data ingestion pipelines and moving terabytes of data from existing data warehouses to the cloud. * Experience in Developing Spark applications using Spark - SQL in Databricks for data extraction, transformation, and aggregation from multiple file formats for analyzing & and transforming the data to uncover insights into customer usage patterns. * Developed PySpark-based pipelines using spark data frame operations to load data to EDL using EMR for jobs execution & and AWS S3 as a storage layer. * Database development experience using SQL, SPARK, or Big Query and experience with a variety of relational, NoSQL-oriented databases like Hadoop, MongoDB and Cassandra. * Integrated data from multiple sources into AWS data lake, performed validation and ETL to load data into Redshift. * Utilize Power BI and SSRS to produce parameter-driven, matrix, sub-reports, drill-down, drill-through, dashboards, and integrated report hyperlink functionality to access external applications and make dashboards available in Web clients and mobile apps. * Scheduled Airflow DAGs to run multiple Hive and Pig jobs, which independently run with time and data availability. * Created and maintained data warehouse, databases, tables, SQL queries, and ingestion pipelines to power reports (Tableau), dashboards, predictive models, and downstream analysis. * Responsible for maintaining and tuning existing cubes using SSAS and Power BI. * Worked on the full spectrum of data engineering pipelines like data ingestion, data transformations and data analysis/consumption. * Developed AWS lambdas using Python & Step functions to orchestrate data pipelines. * Hands-on experience with the Hadoop eco-system (HDFS, MapReduce, Yarn, Hive, Pig, Impala, Spark, Kafka.) * Created an AWS Lambda pipeline to migrate microservices from MuleSoft API Gateway to AWS API Gateway * Worked on extracting the data from the Oracle financials and the Redshift database and Created Glue jobs in AWS and loaded incremental data to the S3 staging area and persistence area. * Migrated in-house database to AWS Cloud and designed, built, and deployed a multitude of applications utilizing the AWS stack (Including S3, EC2, RDS, Redshift, Athena) by focusing on high availability and auto-scaling. * Used Apache Kafka to aggregate web log data from multiple servers and make them available in Downstream systems for analysis and used Kafka Streams to Configure Spark streaming to get information and then store it in HDFS. * Created monitors, alarms, notifications and logs for Lambda functions, Glue Jobs, EC2 hosts using CloudWatch and used AWS Glue for the data transformation, validation and data cleansing. * Worked on Improving the performance and optimization of the existing algorithms in Hadoop using Spark Context, Spark-SQL, Data Frame, Pair RDDs, and YARN. * Participate in data modelling discussion and influence the data architecture to ensure the best performance for solutions required by various teams. * Implemented data ingestion and handling clusters in real-time processing using Kafka. * Experience with Spark applications using Spark-SQL in EMR for data extraction, transformation, and aggregation from multiple file formats for analyzing & and transforming the data to uncover insights into customer usage patterns. * Involved in all the steps and scope of the project reference data approach to MDM, have created a Data Dictionary and Mapping from Sources to the Target in MDM Data.
Data Engineer
Optum
January 2017 - July 2019
- Responsibilities: * Experience writing Hive queries for data analysis to meet business requirements. * Engineered a robust Data pipeline using Spark, Hive, and HBase for seamless data ingestion into the Hadoop cluster, ensuring efficient analysis. * Developed PIG Latin scripts to extract and load data from web server output files into HDFS, enhancing data processing capabilities. * Designed and implemented reliable data transformations for various purposes such as reporting, growth analysis, and multi-dimensional analysis. * Successfully converted Hive/SQL queries into Spark transformations using Spark RDDs and Scala, contributing to improved performance. * Implemented performance optimization techniques, including distributed cache usage, partitioning, and bucketing in Hive, and map-side joins. * Hands-on experience with Big Data technologies on AWS (S3/Glue/EMR/RedShift), contributing to effective data handling. * Expertise in designing, developing, and maintaining software solutions within the Hadoop cluster. * Implemented Zookeeper for concurrent access to Hive Tables with shared and exclusive locking. * Utilized Sqoop for importing/exporting data from Oracle and PostgreSQL into HDFS for analysis purposes. * Conducted quantitative and qualitative data analysis, generating dashboards, reports, and assessments for informed, data-driven decision-making. * Leveraged Spark Data Frame API on the Cloudera platform for analytics on Hive data. * Worked on Role-Based Access Controls across Interface, Logic, and Data Tier. * Deployed Big Data Hadoop applications on Talend Cloud AWS and Microsoft Azure. * Designed batch processing jobs using Apache Spark, achieving a ten-fold increase in speed compared to MR jobs. * Implemented Kafka for real-time data ingestion, creating distinct topics for reading data. * Orchestrated the extraction, transformation, and loading of data from source systems to Azure Data Storage services, employing Azure Data Factory, Spark SQL, and Azure Data Lake Analytics. * Converted HQL into Spark transformations using Spark RDD with support from Python and Scala. * Transferred data from S3 bucket to Snowflake Data Warehouse for report generation. * Proficient in handling various file formats, including delimited text files, clickstream log files, Apache log files, Parquet files, Avro files, JSON files, XML files, and others. * Developed PySpark Streaming by consuming static and streaming data from different sources. * Designed and implemented high-performance data ingestion pipelines from multiple sources using Apache Spark and/or Azure Databricks. * Extensive experience working on AWS, utilizing EMR for operations, EC2 instances, S3 storage, RDS, and Redshift for analytical operations, and writing data normalization jobs for new data ingested into Redshift, managing multi-terabytes of data
Education