Close this
Close this

Tomasz Malesa

Development
Kuyavian-Pomeranian Voivodeship, Poland

Skills

Data Engineering

About

Tomasz Malesa's skills align with Database Specialists (Information and Communication Technology). Tomasz also has skills associated with System Developers and Analysts (Information and Communication Technology). Tomasz Malesa has 9 years of work experience.
View more

Work Experience

Senior Big Data Engineer

NETRONIC Software
April 2020 - May 2024
  • Developed, deployed and optimized ETL & ELT pipelines on the AWS with S3, • Glue, DynamoDB, EMR, Kinesis, Lambda, Redshift and Snowflake. Proven expertise in SQL query optimization, database schema design, and disaster • recovery strategies in the Azure cloud environment. Designed, developed and scaled multiple enhancements related to Machine Learning • using Python, Tensorflow, PyTorch, Scikit-learn, Kubeflow, DVC. Developed Airflow DAG for tasks, operator, and connection with Python and SQL. • Developed comprehensive data models and analytics solutions leveraging • Databricks, Apache Spark, and MLlib, significantly enhancing predictive analytics and business intelligence initiatives. I utilized the Azure cloud platform to architect end-to-end machine learning • pipelines. Leveraging Azure services such as Azure Machine Learning, Azure Databricks, and Azure Kubernetes Service. Implemented a data pipeline on top of Snowflake using Python and SQL for • automating data preparation for some models. Used Pandas for final data checks. Supported AWS Redshift, Redshift Spectrum, ElasticSearch, Kinesis, S3, EC2, RDS, • MySQL, PostgreSQL, Aurora, and CloudWatch. Managed Control-M, Spotfire, and SnapLogic ETL. Mentoring junior Engineers on multiple technologies and processes involving • Python, Spark, Data Warehousing, SQL. Spearheaded the integration of PySpark with other big data technologies (HDFS, YARN, Hive) and data science tools (Pandas, Numpy, Scikit-learn), enabling seamless data ingestion, processing, and analysis workflows. Reduced wasted resources and improved reliability for web and API applications by • containerizing and deploying them with Docker, Kubernetes, and GitLab for CI/CD.

Data Engineer

KENTO SYSTEMS
March 2018 - February 2020
  • Prepared ETL processes using Talend Open Studio for Big Data for various tasks, • like merging data from other company databases and a MongoDB database. Created ETL tasks in Informatica to move data from the production systems into • PostgreSQL. Created solutions such as loading historic data from on-prem Hive to GCP • BigQuery using Scala-Spark, Databricks, and BigQuery and loading SAS data from on-prem to GCP BigQuery using PySpark, Databricks, and BigQuery. Developed back-end code for microservices architecture by building Lambda • functions, APIs, and stored procedures on MySQL and MS SQL databases. Deployed Apache Airflow on the client's Azure cloud environment using Docker • containers and Microsoft Azure SQL Database for PostgreSQL. Decreased end-to-end pipeline runtime by up to 30% and memory footprint by • 20% across thousands of ML scenarios via dimensionality reduction in Python and Spark. Successfully deployed Hadoop distributions in open-source environments, • optimizing integration and performance, while also exploring and integrating advanced technologies like Scala, Kafka, and NoSQL Databases to broaden data processing capabilities. Dockerized, deployed and built CI/CD pipelines for several AI products with Git, • CircleCI, Azure Devops.

Data Engineer

SIMINSIGHTS
May 2015 - March 2018
  • Spearheaded the design and implementation of scalable database solutions using • Azure SQL, optimizing data storage and retrieval processes for high-traffic applications, leading to a 30% improvement in performance and significant cost savings. Helped manage the transfer of inherited Talend master data management workflows • to modern technologies such as Python, Spark SQL, and AWS, mapping a further 20% of product master brands. Deploying a hadoop cluster, maintaining a hadoop cluster, adding and removing • nodes using cluster monitoring tools like Cloudera Manager, configuring the NameNode high availability and keeping a track of all the running hadoop jobs. Built data pipelines using Python libraries like Pandas and PySpark, connected them • to third-party data providers and then loaded them into Snowflake. Deployed the DAGs to Apache Airflow. Implemented data pipelines in Spark running on EMR scheduled with Airflow. • Made a REST API for transliterating from English to Georgian, using Flask and • SQL. Refactored the existing Python data handling script to work within the Airflow • structure and scheduler.

Education

University of Southern California

Bachelor of Science in Computer Science
September 2011 - May 2015