Harshita B
Design
Texas, United States
Skills
Data Engineering
About
Sri Harshita Bokka's skills align with System Developers and Analysts (Information and Communication Technology). Sri also has skills associated with Database Specialists (Information and Communication Technology). Sri Harshita Bokka has 6 years of work experience.
View more
Work Experience
Sr. GCP Data Engineer
AgFirst
May 2021 - Present
- Responsibilities: * Engineered scalable data pipelines using Pyspark and GCP's DataProc Big Query, facilitating efficient processing of extensive farm-related datasets for financial analysis. * Applied SAS programming skills to clean and transform agricultural data, ensuring high-quality input for subsequent analysis and reporting. * Executed seamless data migrations between platforms using Sqoop, ensuring the secure and accurate transfer of financial data vital to farm operations. * Utilized Hadoop and Hive to implement robust ETL processes, ensuring the smooth extraction, transformation, and loading of diverse agricultural datasets. * Designed and optimized data storage solutions on GCS (Google Cloud Storage), ensuring quick and secure access to historical and real-time financial data. * Developed custom Python scripts for data cleansing and transformation, enhancing the accuracy and reliability of financial data related to agricultural activities. * Collaborated with cross-functional teams to integrate diverse agricultural data sources, providing a unified view for comprehensive financial analysis. * Implemented Snowflake as a cloud data warehouse, improving query performance for advanced financial analytics. * Managed data consistency and integrity using DynamoDB and Oracle Database, ensuring reliable financial data for reporting and analysis. * Utilized GCP's DataProc for creating and managing scalable and fault-tolerant clusters, enabling parallel processing of large datasets related to farm financial transactions. * Collaborated with cross-functional teams to design and implement efficient data processing workflows, leveraging GCP's services to enhance the scalability and reliability of data processing tasks. * Leveraged GCP's DataProc Big Query to optimize cloud-based data storage, ensuring efficient organization and retrieval of financial data related to farm operations. * Implemented data partitioning and clustering strategies to enhance query performance and reduce costs associated with storing vast amounts of agricultural data on Google Cloud Storage. * Created interactive financial dashboards using Power Bi, offering stakeholders intuitive visualizations for improved decision-making. * Applied Python for machine learning to enhance predictive modeling for farm-related financial outcomes. * Utilized SDKs to develop connectors and interfaces, ensuring seamless data flow between different applications and platforms. * Collaborated with Data Science teams to implement advanced analytics solutions, leveraging expertise in Data Bricks for enhanced data processing. * Managed and optimized data storage costs by implementing strategies such as archiving data to Glacier and efficiently utilizing SQL Database. * Configured and managed EMR clusters for parallel processing, optimizing performance for data-intensive farm-related applications. * Ensured database security and compliance through the implementation of access controls and regular SQL database audits. * Contributed to the development of disaster recovery plans, safeguarding critical financial data related to agricultural operations. * Provided training and mentorship to junior team members, fostering a collaborative and knowledge-sharing culture within the data engineering team. * Participated in regular code reviews to ensure coding best practices, maintainability, and adherence to established coding standards. * Documented data engineering processes comprehensively, facilitating knowledge transfer and ensuring the sustainability of data solutions. Environment: GCP, Pyspark, SAS, Hive, Sqoop, Teradata, GCPs DataProc Big Query, Hadoop, Hive, GCS, Python, Snowflake, DynamoDB, Oracle Database, Power Bi, java, Machine learning, SDK'S, Data Flow, SQL Database, Data Bricks.
Azure Data Engineer
Valuefy solutions
February 2019 - December 2020
- Responsibilities: * Orchestrated the seamless migration of data from legacy database systems to Azure databases. * Collaborated with external team members and stakeholders to assess the implications of their changes, ensuring smooth project releases and minimizing integration issues in Explore.MS application. * Conducted an in-depth analysis, design, and implementation of modern data solutions using Azure PaaS services to support data visualization. This involved a comprehensive understanding of the current production state and its impact on existing business processes. * Coordinated with external teams and stakeholders, ensuring that changes were thoroughly understood and integrated comfortably to prevent integration issues in the VLIn-Box application. * Assumed responsibility for reviewing VL-In-Box application's test plan and test cases during System Integration and User Acceptance testing phases. * Proficiently executed Extract, Transform, and Load (ETL) processes, extracting data from source systems and storing it in Azure Data Storage services. Leveraged a combination of Azure Data Factory, T-SQL, Spark SQL, and U-SQL in Azure Data Lake Analytics. Data was ingested into one or more Azure Services, including Azure Data Lake, Azure Storage, Azure SQL, and Azure Data Warehouse, with further data processing in Azure Databricks. * Designed and implemented migration strategies for traditional systems in Azure, utilizing approaches like Lift and Shift and Azure Migrate, alongside third-party tools. * Effectively used Azure Synapse to manage processing workloads and deliver data for business intelligence and predictive analytics needs. * Demonstrated experience in data warehouse and business intelligence project implementation using Azure Data Factory. * Created SQS queues to act as message queues for decoupling and distributing workloads across distributed systems. * Configured queue attributes, including message retention periods, visibility timeouts, and maximum message sizes, to meet specific application requirements. * Collaborated with Business Analysts, Users, and Subject Matter Experts (SMEs) to elaborate on requirements and ensure their successful implementation. * Conceptualized and implemented end-to-end data solutions encompassing storage, integration, processing, and visualization within the Azure ecosystem. * Developed Azure Data Factory (ADF) pipelines, incorporating Linked Services, Datasets, and Pipelines for data extraction, transformation, and loading from diverse sources like Azure SQL, Blob storage, Azure SQL Data Warehouse, and write-back tools. * Configured various subscription types, including HTTP/HTTPS endpoints, email addresses, SMS, and other supported protocols. * Extensive experience using Java for developing data processing applications using frameworks such as Apache Flink, Apache Beam, or Spring Batch. * Assumed responsibility for estimating cluster sizes, monitoring, and troubleshooting Spark Data bricks clusters. * Executed ETL processes using Azure Databricks, migrating on-premises Oracle ETL to Azure Synapse Analytics. * Developed custom User-Defined Functions (UDFs) in Scala and PySpark to meet specific business requirements. * Authored JSON scripts to deploy pipelines in Azure Data Factory (ADF), enabling data processing via SQL Activity. * Designed and implemented database solutions in Azure SQL Data Warehouse and Azure SQL. * Proposed architectures with a focus on cost-efficiency within Azure, offering recommendations to right-size data infrastructure. * Established and maintained Azure SQL Database, Azure Analysis Service, Azure SQL Data Warehouse, Azure Data Factory, and Azure SQL Data Warehouse. Environment: Azure, Azure SQL, Blob storage, java, Azure SQL Data Warehouse, Azure Databricks, PySpark, Oracle, Azure Data Factory (ADF), T-SQL, Spark SQL.
Hadoop Developer
Omics International
October 2017 - January 2019
- Responsibilities: * Orchestrated data ingestion from UNIX file systems to the Hadoop Distributed File System (HDFS), employing contemporary data loading techniques. * Utilized Sqoop for seamless importing and exporting of data between HDFS, Hive, and external data sources. Translated business requirements into comprehensive specifications, aligning with modern project guidelines for program development. * Designed and implemented procedures addressing complex business challenges, considering hardware and software capabilities, operating constraints, and desired outcomes. * Conducted extensive analysis of large datasets to identify optimal methods for data aggregation and reporting. Responded promptly to ad hoc data requests from internal and external clients, proficiently generating ad hoc reports. * Led the construction of scalable distributed data solutions using Hadoop, incorporating the latest tools and technologies in the field. * Played a hands-on role in Extract, Transform, Load (ETL) processes, ensuring efficient data handling across various stages. * Managed cluster maintenance, including node management, monitoring, troubleshooting, and reviewing data backups and log files in the Hadoop ecosystem. * Oversaw the extraction of data from diverse sources, executed transformations using Hive and MapReduce, and efficiently loaded data into HDFS. * Conducted in-depth data analysis by running Hive queries and executing Pig scripts to uncover user behavior patterns, such as shopping enthusiasts, travelers, and music lovers. * Exported insights and patterns derived from the analysis back into Teradata using Sqoop. * Ensured continuous monitoring and management of the Hadoop cluster through state-of-the-art tools like Cloudera Manager. Environment: Hive, pig, Apache Hadoop, Cassandra, Sqoop, Big Data, HBase, Zookeeper, Cloudera, CentOS, No SQL, sencha extjs, java script, ajax, Hibernate, Jms, web logic Application server, Eclipse, Web services, azure, Project Server, Unix, Windows.