Prashanth Parikipandla
Development
Illinois, United States
Skills
Data Engineering
About
Prashanth Parikipandla's skills align with System Developers and Analysts (Information and Communication Technology). Prashanth also has skills associated with Database Specialists (Information and Communication Technology). Prashanth Parikipandla has 9 years of work experience.
View more
Work Experience
Senior Data Engineer
ANTHEM
July 2019 - Present
- Responsibilities: * Worked on building the data pipelines (ELT/ETL Scripts), extracting the data from different sources (MySQL, AWS S3 files), transforming, and loading the data to the Data Warehouse (AWS Redshift) * Used Agile Scrum methodology/ Scrum Alliance for development. * Extensive experience in working with AWS cloud Platform (EC2, S3, EMR, Redshift, Lambda and Glue). * Worked on developing & adding few Analytical dashboards using Looker product * Worked on building the data pipelines using PySpark (AWS EMR), processing the data files present in S3 and loading it to Redshift * Set up and configured GCS buckets with appropriate access controls and lifecycle management * Stored and retrieved data efficiently using Google Cloud Storage's object storage capabilities * Worked on building the aggregate tables & de-normalized tables, populating the data using ETL to improve the looker analytical dashboard performance and to help data scientist and analysts to speed up the ML model training & analysis * Played a lead role in gathering requirements, analysis of the entire system and providing estimation on development, testing efforts. * Developed custom Jenkins jobs/pipelines that contained Bash shell scripts utilizing the AWS CLI to automate infrastructure provisioning * Experience in Developing Spark applications using Spark - SQL in Databricks for data extraction, transformation, and aggregation from multiple file formats for analyzing & transforming the data to uncover insights into the customer usage patterns. * Responsible for estimating the cluster size, monitoring, and troubleshooting of the Spark Data Bricks cluster. * Worked on scheduling all jobs using Airflow scripts using python. Adding different tasks to DAG's and dependencies between the tasks. * Developed Spark code using Python and Spark-SQL/Streaming for faster testing and processing of data and Data Extraction, aggregations, and consolidation of Adobe data within AWS Glue using PySpark. * Performed SQL queries on AWS Athena on the database from AWS Glue. * Implemented Spark using Scala and SparkSQL for faster testing and processing of data. * Developed a user-eligibility library using Python to accommodate the partner filters and exclude these users from receiving the credit products * Built the data pipelines to aggregate the user click stream session data using spark streaming module which reads the click stream data from Kinesis streams and store the aggregate results in S3 and data and eventually loaded to AWS Redshift warehouse * Worked on supporting & building the infrastructure for the core module of the Credit Sesame i.e., Approval Odds, started with Batch ETL, moved to micro-batches, and then converted to a real time predictions * Used Jira for ticketing and tracking issues and Jenkins for continuous integration and continuous deployment. Enforced standards and best practices around data catalog, data governance efforts * Created numerous ODI interfaces and load into Snowflake DB. * Developing and writing SQLs and stored procedures in Teradata. Loading data into a snowflake and writing Snow SQL scripts. * Worked on Amazon Redshift for shifting all Data warehouses into one Data warehouse. * Good understanding of Cassandra architecture, replication strategy, gossip, snitches etc. * Designed columnar families in Cassandra and Ingested data from RDBMS, performed data transformations, and then exported the transformed data to Cassandra as per the business requirement. * Created DataStage jobs using different stages like Transformer, Aggregator, Sort, Join, Merge, Lookup, Data Set, Funnel, Remove Duplicates, Copy, Modify, Filter, Change Data Capture, Change Apply, Sample, Surrogate Key, Column Generator, Row Generator, Etc. * Good experience with Version Control tools Bitbucket, GitHub, GIT. * Experience with Jira, Oozie, Airflow scheduling tools. * Experienced in Strong scripting skills in Python, Scala, and UNIX shell. * Involved in writing Python, Java API s for Amazon Lambda functions to manage the AWS services. * Used the Spark Data Cassandra Connector to load data to and from Cassandra. * Worked from Scratch in Configurations of Kafka such as Mangers and Brokers. * Developed the AWS Lambda server less scripts to handle ad-hoc requests * Performed Cost optimization reduced the infrastructure costs * Knowledge and experience on using Python NumPy, Pandas, Sci-kit Learn, Onnx & Machine Learning * Strong SQL skills were used to create queries that extract data from multiple sources and build performant datasets. * Worked on scheduling all jobs using Airflow scripts using python added different tasks to DAG, LAMBDA. * Worked on adding the Rest API layer to the ML models built using Python, Flask & deploying the models in AWS Beanstalk Environment using Docker containers * Other activities include supporting and keeping the data pipelines active, working with Product Managers, Analysts, Data Scientist & addressing the requests coming from them, unit testing, load testing and SQL optimizations. * Converts GoLang scripts into spark jobs which takes necessary fields from impala and populate them into HBase * Performed data manipulation, cleaning, and statistical analysis using SAS/Statistical Analysis System. * Used Statistica to do tasks including data exploration, visualization, and predictive modeling. * Performed statistical tests, regression analysis, and data visualization using SPSS. * Developed visually appealing charts, graphs, and reports with SAS, Statistica, SPSS, or SAS Enterprise Miner's visualization features. * successfully presented to stakeholders complicated data findings using lucid and perceptive graphic representations. * To convey data-driven insights to business teams, interactive dashboards and reports were developed. * Implemented object-oriented programming (OOP) concepts in Python to create reusable and modular code for data processing pipelines, enhancing code maintainability and readability. * Utilized Jupyter Notebooks to explore, manipulate, and analyze data using Python and other languages. Used the tool to create and share documents that contain live code, equations, visualizations, and narrative text. * Worked with DBT (Data Build Tool) to develop, test, and maintain data pipelines in AWS. Utilized DBT to manage dependencies, versioning, and modularization of data transformations. * Ensured metadata management, data lineage, and data governance principles were followed in all data engineering projects. Created entity-relationship diagrams (ERDs) to understand the relationships between data entities and their attributes. * Built out data lineage by reverse-engineering existing data solutions. * Developed and implemented data processing pipelines using Databricks and PySpark to extract, transform, and load (ETL) large datasets in various formats, such as CSV, JSON, and Parquet, for ingestion into AWS data lake. * Implemented and maintained data governance practices, including data access controls, data masking, and data encryption, to ensure data security and compliance with data privacy regulations such as GDPR and CCPA. * Leveraged Python and PySpark to clean, preprocess, and transform raw data to ensure data quality and consistency, and performed data validation and data profiling to identify and resolve data issues. * POC to explore AWS Glue capabilities on Data cataloging and Data integration. * Expertise in designing, implementing, and optimizing data pipelines and ETL (Extract, Transform, Load) processes using Java. * Strong understanding of data integration patterns and best practices, and ability to handle data ingestion, cleansing, transformation, and validation using Java-based tools and frameworks. * Demonstrated experience working with various AWS services, such as AWS Glue, AWS Lambda, AWS EMR, and AWS Redshift, using Java. * Strong understanding of Java frameworks and libraries commonly used in data engineering, such as Apache Spark, Apache Kafka, and Apache Hadoop. * Experience in leveraging Java-based distributed computing frameworks, such as Apache Spark, to process large volumes of data in parallel across clusters. * Proficient in optimizing and tuning Java applications for distributed environments, ensuring high performance and scalability. * Designed and built an operational data store (ODS) on Oracle Database and Autonomous Data Warehouse Cloud (ADWC) to centralize and manage critical organizational data. * Developed and maintained ETL processes to populate the ODS, ensuring data integrity and accuracy for rporting and analytics. * Proficient in utilizing Buesiness Intelligence Publisher (BIP) reports and Business Intelligence Cloud Connector (BICC) jobs to efficiently extract data from Oracle Fusion applications. * Generated and delivered comprehensive reports and insights using BIP, enabling data-driven decision-making across the organization. * Hands-on experience with Oracle Data Integrator (ODI), implementing data integration solutions, and building data pipelines on Oracle target databases. * Proficient in Oracle Data Integrator (ODI) for data integration and ETL tasks. * Extensive experience in data extraction from diverse sources, including databases, flat files, and APIs using ODI. * Skilled in data transformation, cleansing, and aggregation through SQL and PLSQL transformations within ODI. * Expertise in loading data into Oracle target databases while ensuring data quality and integrity. * Maintenance of detailed documentation for ODI workflows, processes, and transformations. * Strong capabilities in implementing error-handling mechanisms to identify and resolve data integration issues. * Proven track record in optimizing ODI mappings and workflows for improved performance. * Proficient in data modeling to maintain data consistency and efficiency in ODI projects. * Automation of data integration tasks, schedules, and monitoring within ODI for enhanced efficiency. * Utilized SQL and PL/SQL to create, optimize, and automate data pipelines, ensuring seamless data transfer and transformation. * Employed best practices for data modeling, data warehousing, and ETL to enhance data quality and streamline data processing. Environment: Groovy, Go, Python, Kafka, Flask, NumPy, Pandas, SQL, MySQL, Cassandra, AWS EMR, Spark, GIT, AWS Kinesis, AWS Redshift, AWS EC2, AWS S3, AWS Beanstalk, AWS Lambda, AWS data pipeline, AWS cloud-watch, Airflow, Docker, Shell scripts, Snowflake.
Azure - Sr Data Engineer
American Express
December 2017 - July 2019
- Responsibilities: * Used Azure Data Factory extensively for ingesting data from disparate source systems. * Used Azure Data Factory as an orchestration tool for integrating data from upstream to downstream systems. Automated jobs using different triggers (Event, Scheduled and Tumbling) in ADF. * Used Cosmos DB for storing catalog data and for event sourcing in order processing pipelines. * Designed and developed user defined functions, stored procedures, triggers for Cosmos DB. * Analyzed the data flow from different sources to target to provide the corresponding design Architecture in Azure environment. * Proficient in Azure Data Factory to perform Incremental loads from Azure SQL DB to Azure Synapse. * Experience in provisioning and managing virtual machines using Google Compute Engine. * Managed large-scale data transfers and migrations to and from Google Cloud Storage * Extensively used SQL Server Import and Export Data tool. * Created database users, logins, and permissions to setup. * Analyzed existing database, tables, and other objects to prepare to migrate to Azure Synapse. * Loaded data from various sources like flat files, to SQL Server database Using SSIS Package. * Implemented Side - by- Side Migration of MS SQL SERVER 2016. * Involved in daily production server check list, SQL Backups, Disk Space, Job Failures, System Checks, checking performance statistics for all servers using monitoring tool and research and resolve any issues, checking connectivity. * Established a formal EDM, MDM (Master Data Management) program that creates effective engagement between Business operation, EDM, delivery team and IT. * Designed and developed SSIS (ETL) packages to validate, extract, transform and load data from OLTP system to the Data warehouse. * Designed and implemented Tables, Functions, Stored Procedures and Triggers in SQL Server 2016 and wrote the SQL. * Take initiative and ownership to provide business solutions on time. * Created High level technical design documents and Application design documents as per the requirements and delivered clear, well-communicated and complete design documents. * Created DA specs and Mapping Data flow and provided the details to developer along with HLDs. * Created Build definition and Release definition for Continuous Integration (CI) and Continuous Deployment (CD). * Created Application Interface Document for the downstream to create new interface to transfer and receive the files through Azure Data Share. * Creating pipelines, data flows and complex data transformations and manipulations using ADF and PySpark with Databricks. * Conducted quarterly Data owner meetings to communicate upcoming Data Governance initiatives, processes, policies and best practices. * Ingested data in mini-batches and performs RDD transformations on those mini-batches of data by using Spark Streaming to perform streaming analytics in Data bricks. * Created, provisioned different Databricks clusters needed for batch and continuous streaming data processing and installed the required libraries for the clusters. Integrated Azure Active Directory authentication to every Cosmos DB request sent and demoed feature to Stakeholders * Spearheaded the migration of critical application data from on-premises SQL Server to Azure Cosmos DB, achieving a 35% increase in data access speed and enhancing global scalability. * Designed and implemented a multi-model database architecture in Azure Cosmos DB to support key-value, document, and graph data models, enabling flexible and efficient data management across diverse application needs. * Managed SQL Server databases with extensive use of T-SQL for querying, stored procedures, and triggers, resulting in a 25% improvement in transaction processing times and data integrity. * Led a project to optimize SQL Server performance through indexing, query optimization, and database tuning, reducing report generation times by over 40% for business-critical operations. * Developed and deployed scalable workflows using Azure Logic Apps to automate business processes, integrating services such as Azure Cosmos DB and Office 365, which reduced manual processing by 50%. * Implemented Azure Function Apps for event-driven processing connected to Azure Cosmos DB, enhancing data processing capabilities and enabling real-time data insights for decision-making. * Orchestrated a seamless migration of multiple terabytes of data from on-premises SQL Server to Azure Cosmos DB, employing Azure Data Factory for data movement and transformation, ensuring zero downtime and data consistency. * Conducted thorough pre-migration assessments and post-migration optimizations, leveraging Azure Cosmos DB's global distribution to improve application responsiveness by 30% across geographically dispersed user bases. * Integrated Azure Cosmos DB with Azure Logic Apps and Azure Function Apps to automate data workflows, achieving a highly responsive, serverless architecture that scales automatically with demand. * Enhanced data security during the migration process by implementing Azure's advanced security features, including data encryption, access control, and threat detection, aligning with compliance requirements. * Improved performance by optimizing computing time to process the streaming data and saved cost to the company by optimizing the cluster run time. Perform ongoing monitoring, automation and refinement of data engineering solutions to prepare complex SQL views, stored procedures in Azure SQL Datawarehouse and Hyperscale. * Designed and developed a new solution to process the NRT data by using Azure stream analytics, Azure Event Hub, and Service Bus Queue. Created Linked service to land the data from SFTP location to Azure Data Lake. * Created numerous pipelines in Azure using Azure Data Factory v2 to get the data from disparate source systems by using different Azure Activities like Move Transform, Copy, filter, for each, Databricks etc. * Working with complex SQL, Stored Procedures, Triggers, and packages in large databases from various servers. * Helping team members to resolve any technical issue, Troubleshooting, Project Risk & Issue identification, and management Addressing resource issue, Monthly one on one, Weekly meeting. Environment: Azure Cloud, Azure Data Factory (ADF v2), Azure functions Apps, Azure Data Lake, BLOB Storage, Visual Studio 2012/2016, Microsoft SQL Server 2012/2016, SSIS 2012/2016, Teradata Utilities, Windows remote desktop, UNIX Shell Scripting, AZURE PowerShell, Data bricks, Go, Python, Erwin Data Modelling Tool, Azure Cosmos DB, Azure Stream Analytics, Azure Event Hub.
Data Engineer
Kellogg's, NY
January 2017 - November 2017
- Responsibilities: * As a Big Data Developer implemented solutions for ingesting data from various sources and processing the Data-at-Rest utilizing Big Data technologies such as Hadoop, MapReduce Frameworks, MongoDB, Hive, Oozie, Flume, Sqoop and Talend etc. * Migrate on in-house database to AWS Cloud and designed, built, and deployed a multitude of applications utilizing the AWS stack (Including EC2, RDS) by focusing on high-availability and auto-scaling. * Worked on analyzing Hadoop cluster using different big data analytic tools including Flume, Pig, Hive, HBase, Oozie, Zookeeper, Sqoop, Spark and Kafka. * Developed Spark code using Python and Spark-SQL/Streaming for faster testing and processing of data and * Data Extraction, aggregations, and consolidation of Adobe data within AWS Glue using PySpark. * Implemented Kafka Custom encoders for custom input format to load data into Kafka Partitions and Real time streaming the data using Spark with Kafka for faster processing. * Deployed and managed Apache Spark and Hadoop clusters on Google Cloud Dataproc. * Submitted and monitored Spark and Hadoop jobs for efficient data processing and analysis * Installed 5 node Hortonworks cluster in AWS and Google cloud on ec2 instances and setting up Hortonworks Data Platform Cluster on Cloud and configuring it to be used as a Hadoop platform for running jobs. * Involved in analyzing data coming from various sources and creating Meta-files and control files to ingest the data into the Data Lake. * Developed data pipeline programs with Spark Python APIs, data aggregations with Hive, and formatting data (json) for visualization, and generating. E.g., High charts: Outlier, data distribution, Correlation/comparison and * Extensively worked on Python and build the custom ingest framework and worked on Rest API using python. Analyzed the SQL scripts and designed the Solution to Implement Using PySpark and created custom new columns depending up on the use case while ingesting the data into Hadoop Lake using pyspark. * Explored with the Spark improving the performance and optimization of the existing algorithms in Hadoop using Spark Context, Spark -SQL, Data Frame, Pair RDD's, Spark YARN. * Handled importing of data from various data sources, performed transformations using Hive, MapReduce, loaded data into HDFS and Extracted the data from SQL into HDFS using Sqoop. * Installed Hadoop, Map Reduce, and HDFS and developed multiple MapReduce jobs in PIG and Hive for data cleaning and pre-processing. * Imported the data from different sources like HDFS/HBase into SparkRDD and configured deployed and maintained multi-node Dev and Test Kafka Clusters. * Virtualized the servers using the Docker for the test environments and dev-environments needs. And also, configuration automation using Docker containers * Created Elastic Map Reduce (EMR) clusters and Configured the Data pipeline with EMR clusters for scheduling the task runner and provisioning of Ec2 Instances on both Windows and Linux. * Involved in converting MapReduce programs into Spark transformations using Spark RDD's on Scala and developed Spark scripts by using Scala Shell commands as per the requirement. * Performed transformations, cleaning and filtering on imported data using Hive, Map Reduce, and loaded final data into HDFS. * Exploring with the Spark for improving the performance and optimization of the existing algorithms in Hadoop using Spark Context, Spark-SQL, Data Frame, Pair RDD's, Spark YARN. * Design & implement ETL process using Talend to load data from Worked extensively with Sqoop for importing and exporting the data from HDFS to Relational Database systems/mainframe and vice-versa. Loading data into HDFS. * Created and altered HBase tables on top of data residing in Data Lake and Created external Hive tables on the Blobs to showcase the data to the Hive Meta Store. * Utilized Agile Scrum Methodology to help manage and organize a team of 4 developers with regular code review sessions. Environment: Hadoop, Data Lake, JavaScript, Python, HDFS, Spark, AWS Redshift, AWS Glue, Lambda, MapReduce, Pig, Hive, Sqoop, Kafka, HBase, Oozie, Flume, Scala, Python, Java, SQL Scripting and Talend, Pyspark, Linux Shell Scripting, Kinesis, Docker, Zookeeper, HBase, EC2, EMR, S3, Oracle, MySQL.
Software Engineer
Capgemini Indian Pvt
August 2014 - November 2016
- Responsibilities: * Involved in Technical and High-level Design Review Meetings with Business Testers and Business Owners and completed Software Development Life Cycle (SDLC) phases of the project including designing, developing, testing, and deployment of applications. * Used Agile Methodology and having hands on experience in sprints to focus on continuous improvement in the development of a product or service. * Design and Coding of various JAVA, J2EE modules like Spring Boot, Spring MVC, Spring Rest, Hibernate, JPA. * Involved in Requirements gathering, Analysis, Design, Development and Testing of application using AGILE methodology (SCRUM) * Implementing and exposing the Micro services based on RESTful API using Spring Boot. * Used MAVEN to define the dependencies and build the application and used JUnit for suite execution and Assertions. * Customized cluster configurations and optimized performance for specific workloads. * Integrated Google Cloud Dataproc with other GCP services like BigQuery and Cloud Storage for end-to-end data processing pipelines. * Develop CI/CD services for the internal engineering teams. * Develop and optimize CI build jobs with the help of Jenkins. * Modified Hibernate config.xml to successfully connect to the database. * Managed the code versioning and releases versioning through Git and developed Use Case Diagrams and Class Diagrams. * Developed the project within the Rest services to get web services quality ways. * Done the database part in the MySQL with using the script. * Developed custom SQL scripts to improve the database efficiency, reduce data load time and enhance performance. * Deployed various J2EE applications and archives WAR and EAR applications in Production and Non-Production environments. * Configured and administered JDBC Data sources/ Connection Pools/ Multi Pools on WebLogic Server. * Extensively worked with QA team coordinating testing and automation cycle * Developed Unit Tests using Junit, Mockito and Involved in functional, integration and Performance testing. Environment: Spring Data JPA, Spring Boot, Micro services, MySQL, Apache Tomcat, REST, XML, Log4j, GITHUB, Agile, Windows, Java, Restful Web services, JSON, Eclipse, JIRA, Maven, J2EE, Spring, JavaScript, selenium, HTML, CSS, and Bootstrap.