Shiromani Neupane
Development
MD, United States
Skills
Data Engineering
About
SHIROMANI NEUPANE's skills align with System Developers and Analysts (Information and Communication Technology). SHIROMANI also has skills associated with Programmers (Information and Communication Technology). SHIROMANI NEUPANE has 9 years of work experience.
View more
Work Experience
Hadoop/Spark Developer (Big Data Engineer)
Comcast
October 2021 - Present
- RESPONSIBILITIES: * Developed data pipelines using Spark to ingest customer behavioral data and financial histories into HDFS for analysis. * Developed robust data pipelines capable of seamlessly ingesting diverse datasets from various sources, ensuring the efficient transfer and transformation of data to the specified destination. * Implemented authentication and authorization mechanisms using OAuth, JWT, or other security protocols to secure API endpoints and protect sensitive data. * Skilled in Java programming with a strong background in developing and maintaining Java-based applications, providing expertise in building robust and scalable software solutions. * Proficient in Java programming language with a deep understanding of its core concepts, data structures, and design patterns, leveraging Java for developing scalable and high-performance applications. * Extracted data from Salesforce and loaded it into Azure Blob Storage using Azure Data Factory with an SFTP server. * Implemented scheduled and custom event triggers for deploying Azure Synapse pipelines through Azure DevOps. * Provided support for Azure database and DevOps infrastructure as code. * Skilled in distributed computing and parallel processing techniques, utilizing MapReduce and Spark for processing large volumes of data efficiently across distributed clusters. * Extensively experienced with Azure Synapse Analytics, encompassing data warehousing, data integration, and advanced analytics. * Developed comprehensive API documentation, including usage guidelines, request/response formats, and error handling procedures, to facilitate integration by external developers and stakeholders. * Troubleshot and debugged API issues reported by users, identifying root causes and implementing timely resolutions to minimize downtime and disruption to business operations. * Stayed current with industry trends and emerging technologies in API development, continuously seeking opportunities to enhance skills and improve development processes. * Strong background in data engineering principles, including data modeling, ETL (Extract, Transform, Load) processes, and data warehousing concepts, to support analytics and reporting requirements. * Hands-on experience with Azure DevOps for CI/CD (Continuous Integration/Continuous Deployment) pipelines, enabling automated deployment and management of big data solutions for agility and reliability. * Familiarity with Azure security and compliance standards, implementing data encryption, access controls, and monitoring solutions to ensure data privacy and regulatory compliance. * Effective communicator and collaborator, with a proven track record of working closely with cross-functional teams to understand business requirements, design scalable data architectures, and deliver innovative solutions that drive business outcomes. * Designed and implemented robust end-to-end data pipelines, incorporating Hadoop, Hive, and Spark technologies to streamline data processing workflows. * Worked on Hive by creating external and internal tables, loading it with data and writing Hive queries. * Utilized Kafka for real-time data streaming, enhancing the organization's ability to respond promptly to critical business events. * Integrated Teradata into the data architecture, optimizing data warehousing and improving query performance. * Responsible for translating business and data requirements into logical data models in support Enterprise data models, ODS, OLAP, OLTP and Operational data structures. * Implemented and managed job scheduling using Autosys, ensuring timely execution of critical data processes and minimizing downtime. * Collaborated with cross-functional teams to analyze business requirements, design effective data solutions, and ensure the integrity of data throughout the pipeline. * Optimized Hive queries for enhanced performance, enabling faster data retrieval and analysis. * Performed analysis for the client requirements based on the developed detailed design documents. * Created logical and physical data models and reviewed these models with business team and data architecture team. * Developed data pipelines ingesting data from different sources into desired location. * Developed data pipeline using Spark to ingest customer behavioral data and financial histories into HDFS for analysis. * Developed the Sqoop scripts to make the interaction between Pig and Oracle. * Writing the script files for processing data and loading them to HDFS. Worked extensively with Sqoop for importing data from Oracle. * Experienced with Databrick's Delta Lake for managing and processing large-scale data pipelines for reliability, performance, and data integrity to big data workloads. * Utilized Apache Big Data ecosystem tools like HDFS, Hive and Pig for large datasets analysis. * Developed Pig and Hive UDF to analyze complex data to find specific user behavior. * Created Data Lake as a Data Management Platform for Hadoop. * Used Pig as ETL tool to do transformations, event joins and some pre-aggregations before storing the data onto HDFS. * Worked on Ab Initio for parallel processing capabilities, scalability, and flexibility in handling large volumes of data in complex computing environments. * Migrated ETL jobs to Pig scripts do Transformations, even joins and some pre-aggregations before storing the data onto HDFS. * Optimized data storage and processed infrastructure for performance and scalability. * Kept up to date with emerging trends and advancements in Cloudera and CDP technologies. * Providing technical guidance and mentorship to junior data engineers. * Developed Pig Latin scripts to extract data from the web server output files to load into HDFS. * Experienced in using Pig for data cleansing and developed Pig Latin scripts to extract the data from web server output files to load into HDFS. * Maintaining and monitoring clusters. Loaded data into the cluster from dynamically generated files using Flume and from relational database management systems using Sqoop. * Worked on Hive by creating external and internal tables, loading it with data and writing Hive queries. * Maintained MapReduce jobs to ensure maintenance of the Map R Data Lake Cluster * Created HBase tables to store data from various sources. * Built Spark applications using Spark and used Python programming languages for data engineering in Spark framework. Used PyCharm IDE for development. * Developed Apache Spark jobs using Scala in test environment for faster data processing. * Worked on Kafka Streaming and built Kafka cluster setup required for the environment. Environment: Spark, Hive, MapReduce, Hadoop, HDFS, Oracle, HBase, Flume, Pig, Sqoop, Oozie, Python, Azure Data Factory pipelines, Azure Databricks, Azure Synapse, Azure HDInsights, Azure logic App, Azure Functions, Azure DevOps, Synapse notebooks, Azure Machine Learning Studio, DataStage, Netezza, Squirrel, Control M, Stream Sets, Snowflake, AWS, PL/SQL, Git, Erwin, NoSQL, OLAP, OLTP, SSIS, MS Excel, SSRS, Visio, Cloudera, CDP, Databricks, Teradata, Delta Lake, AWS, S3, EC2, IAM, Terraform, Athena, Glue.
Sr Hadoop Developer
Verizon Media/Yahoo, Dulles-VA
March 2021 - September 2021
- RESPONSIBILITIES: * Currently working on Yahoo Hadoop Grid in both Dev and Prod Environment. * Responsible for building scalable distributed data solutions using Big Data technologies like Apache Hadoop, MapReduce, Shell Scripting, Hive, Pig, Oozie, Spark, Spark SQL, Py-Spark, Spark-Scala. * Implemented and managed data pipelines on Cloudera Data Platform (CDP) to facilitate seamless data ingestion, processing, and analysis. * Orchestrated ETL workflows using Apache NiFi and Apache Airflow, optimizing data flow and ensuring timely delivery to downstream applications. * Demonstrated proficiency in administering Cloudera Distribution including Apache Hadoop (CDH), configuring clusters, and troubleshooting performance issues for enhanced data processing capabilities. * Performed analysis for the client requirements based on the developed detailed design documents. * Wrote complex Hive queries to extract data from heterogeneous sources (Data Lake) and persist the data. * Successfully performed on Integration test on Screwdriver CI/CD Pipeline * Worked on Unit test through Java for developing software test. * Involved in the data ingestion process through Pig to load data into HDFS from sources. * Designed and developed end to end ETL processing from HDFS to AWS S3, and vice versa. * Developed the code to perform Data extractions from HDFS and load it into AWS platform using AWS Data Pipeline. * Worked on Performance tuning of Map Reduce jobs and SQL queries. * Designed and developed Big Data analytic solutions on a Hadoop-based platform and engage clients in technical discussions. * Worked with Oozie commands to schedule jobs in dev and production environment. * Developed workflow in Oozie to automate the tasks of loading the data into HDFS and pre-processing with Pig. * Implemented AWS cloud computing platform using S3, RDS, Athens, EC2, IAM. * Responsible for loading and transforming huge sets of structured, semi structured Data using Pig Latin Language and Hive. * Created logical and physical data models and reviewed these models with business team and data architecture team. * Developed Spark streaming application to pull data from cloud to Hive table. * Worked on Oozie to schedule and monitor the batch jobs from different application groups in multiple servers to reduce the resource manual work. * Involved in converting MapReduce programs into Spark transformations using Spark Scala API. * Developed Spark code using Scala and Spark-SQL/Streaming for faster processing of data. * Developed Spark scripts by using python and bash Shell commands as per the requirement. * Responsible for translating business and data requirements into logical data models in support Enterprise data models, ODS, OLAP, OLTP and Operational data structures. * Used Git as version control tool, merged the Hadoop code in Git branch. Environment: Java, Hive, MapReduce, Hadoop, HDFS, Oracle, Spark, HBase, Flume, Pig, Sqoop, Oozie, Python,Cloudera, DataStage, Netezza, Squirrel, Control M, Stream Sets, Snowflake, AWS, PL/SQL, Git, Erwin, NoSQL, OLAP, OLTP, SSIS, MS Excel, SSRS, Visio
Big Data Engineer
USAA
April 2020 - March 2021
- RESPONSIBILITIES: * Responsible for building scalable distributed data solutions using Big Data technologies like Apache Hadoop, MapReduce, Shell Scripting, Hive. * Used Agile (SCRUM) methodologies for Software Development. * Wrote complex Hive queries to extract data from heterogeneous sources (Data Lake) and persist the data into HDFS. * Involved in the data ingestion process through DataStage to load data into HDFS from sources. * Designed and developed end to end ETL processing from Oracle to AWS using Amazon S3, EMR, and Spark. * Developed the code to perform Data extractions from Oracle Database and load it into AWS platform using AWS Data Pipeline. * Worked on Performance tuning of DataStage jobs and SQL queries. * Installed and configured Hadoop ecosystem like HBase, Flume, Pig and Sqoop. * Designed and developed Big Data analytic solutions on a Hadoop-based platform and engaged clients in technical discussions. * Worked with Sqoop commands to import the data from different databases. * Developed workflow in Oozie to automate the tasks of loading the data into HDFS and pre-processing with Pig. * Implemented AWS cloud computing platform using S3, RDS, Dynamo DB, Redshift, and Python. * Responsible for loading and transforming huge sets of structured, semi structured and unstructured data. * Extensively involved in writing PL/SQL, stored procedures, functions and packages. * Created logical and physical data models using Erwin and reviewed these models with business team and data architecture team. * Developed Spark streaming application to pull data from cloud to Hive table. * Worked on Control-M to schedule and monitor the batch jobs from different application groups in multiple servers to reduce the resource manual work. * Involved in converting MapReduce programs into Spark transformations using Spark python API. * Developed Spark code using Scala and Spark-SQL/Streaming for faster processing of data. * Configured Stream Sets to store the converted data to Oracle Database using JDBC drivers. * Developed Spark scripts by using python and bash Shell commands as per the requirement. * Worked with NoSQL databases like HBase in creating tables to load large sets of semi structured data coming from source systems. * Used Kafka and Storm for real time data ingestion and processing. * Responsible for translating business and data requirements into logical data models in support Enterprise data models, ODS, OLAP, OLTP and Operational data structures. * Created SSIS packages to migrate data from heterogeneous sources such as MS Excel, Flat files and CVS files. * Develop SQL queries using stored procedures, common table expressions (CTEs), temporary table to support SSRS and Power BI reports. * Designed Data Marts by following Star Schema and Snowflake Schema Methodology, using industry leading Data Modeling tools like Erwin. * Developed the Star Schema/Snowflake Schema for proposed warehouse models to meet the requirements. * Used Git as version control tool, merged the Hadoop code in Git branch. Environment: Hive, MapReduce, Hadoop, HDFS, Oracle, Spark, HBase, Flume, Pig, Sqoop, Oozie, Python, DataStage, Netezza, Squirrel, Cloudera, Control M, Stream Sets, Snowflake, AWS, PL/SQL, Git, Erwin, NoSQL, OLAP, OLTP, SSIS, MS Excel, SSRS, Visio
Hadoop Developer
Centene
January 2017 - March 2020
- RESPONSIBILITIES: * Responsible for understanding the scope of the project and requirements gathering. * Loaded log data into HDFS using Flume. * Experienced with Cloudera and CDP technologies includers Cloudera Manager, Cloudera Data Hub, Cloudera Data Warehouse, and related tools. * Implementing and maintaining data governance practices and standards. * Experience in designing and implementing end-to-end data pipelines from various data sources into target locations. * Developed data pipeline using Flume, Sqoop, Pig and Python MapReduce to ingest customer behavioral data and financial histories into HDFS for analysis. * Developed Python scripts to extract the data from the web server output files to load into HDFS. * Developed Flume ETL job for handling data from HTTP Source and Sink as HDFS. * Involved in HBASE setup and storing data into HBASE, which will be used for further analysis. * Used Pig as ETL tool to do transformations, event joins and some pre-aggregations before storing the data on to HDFS. * Written the Apache PIG scripts to process the HDFS data. * Created Hive tables to store the processed results in a tabular format. * Developed the Sqoop scripts to make the interaction between Pig and Oracle. * Writing the script files for processing data and loading them to HDFS. Worked extensively with Sqoop for importing data from Oracle. * Utilized Apache Big Data ecosystem tools like HDFS, Hive and Pig for large datasets analysis. * Developed Pig and Hive UDF to analyze complex data to find specific user behavior. * Created Data Lake as a Data Management Platform for Hadoop. * Used Pig as ETL tool to do transformations, event joins and some pre-aggregations before storing the data onto HDFS. * Migrated ETL jobs to Pig scripts do Transformations, even joins and some pre-aggregations before storing the data onto HDFS. * Developed Pig Latin scripts to extract data from the web server output files to load into HDFS. * Experienced in using Pig for data cleansing and developed Pig Latin scripts to extract the data from web server output files to load into HDFS. * Maintaining and monitoring clusters. Loaded data into the cluster from dynamically generated files using Flume and from relational database management systems using Sqoop. * Worked on Hive by creating external and internal tables, loading it with data and writing Hive queries. * Maintained MapReduce jobs to ensure maintenance of the MapR Data Lake Cluster. * Created HBase tables to store data from various sources. * Built Spark applications using PySpark and used Python programming languages for data engineering in Spark framework. Used PyCharm IDE for development. * Developed Apache Spark jobs using Scala in test environment for faster data processing. * Worked on Kafka Streaming and built Kafka cluster setup required for the environment. * Developed workflow in Oozie to automate the tasks of loading data into HDFS and pre-processing with Pig and Hive (Data Warehouse). * Worked with various Hadoop file formats, including Text, Sequence File, RCFILE and ORC File. * Implemented AWS solutions using EC2, S3, RDS, EBS, Elastic Load Balancer, Auto scaling groups and EC2 instances. * Developed AWS Lambda to invoke glue job as soon as a new file is available in Inbound S3 bucket. * Configured Zookeeper for Cluster co-ordination services. Actively involved in code review and bug fixing to improve the performance. ENVIRONMENT: Hadoop, Hortonworks, HDFS, Pig, Hive, HBase, Zookeeper, Flume, Python, Spark, Kafka, Scala, Sqoop, Spark, Elastic Search, Oozie, Java, Cloudera, Oracle, Docker, Kubernetes, AWS, Windows, UNIX Shell Scripting.
Big Data Engineer
AHS
January 2015 - December 2016
- RESPONSIBILITIES: * Responsible for building scalable distributed data solutions using Hadoop. * Job duties include design and development of various modules in Hadoop Big Data platform and processing data using MapReduce, Hive, SQOOP, Kafka and Oozie. * Developed job processing scripts using Oozie workflow. * Worked with Apache Hadoop and Spark. * Used Spark API over Hortonworks, Hadoop YARN to perform data analysis in Hive. * Developed Spark Applications by using Scala and Implemented Apache Spark data processing project to handle data from various RDBMS and Streaming sources. * Used Data frame API for converting the distributed collection of data organized into named columns. * Involved in Hadoop cluster task like commissioning & decommissioning Nodes without any effect to running jobs and data. * Worked extensively with Sqoop for importing metadata from Oracle. * Worked on AWS, utilized the tools: EC2, IAM, S3 bucket, Elastic Load balancers (ELB). * Real streaming the data using Spark with Kafka. * Designed, developed and did maintenance of data integration programs in a Hadoop and RDBMS environment with both traditional and non-traditional source systems as we as RDBMS and NoSQL data stores for data access and analysis. * Involved in installing, configuring and managing Hadoop Ecosystem components like Hive, Pig, Sqoop, Kafka and Flume. * Developed Spark SQL scripts using Scala to perform transformations and actions on RDD's in spark for faster data Processing. * Assisted in exporting analyzed data to relational databases using Sqoop. * Wrote Hive Queries and UDF's. * Developed Hive queries to process the data and generate the data cubes for visualizing. ENVIRONMENT: MapReduce, Spark, HDFS, Pig, HBase, Oozie, Zookeeper, Sqoop, Scala, Linux, Kafka, Hadoop, Maven, NoSQL, MySQL, PostgreSQL, Hive, Java, Eclipse, Python, PySpark.