A research paper · 2026 · 12 chapters
Data‑Driven
Customer Review
Analysis
A machine learning approach for business decision making — predicting review outcomes and turning signal into strategy across 99,441 Brazilian e‑commerce orders.
Best Model
Random Forest
F1 0.7565
Top Accuracy
75.40%
Random Forest · 80/20 split
Strongest Signal
Delivery
delay_days · is_late · delivery_time
Models Compared
3
Logistic · RF · XGBoost
02 · Abstract
An overview of the study.
This project presents a data‑driven analysis of customer review outcomes using machine learning techniques to support business decision‑making. The study investigates how operational, product, and customer‑related variables influence review ratings and evaluates the predictive performance of three models: Logistic Regression, Random Forest, and XGBoost.
A complete analytical pipeline was implemented — data preprocessing, exploratory data analysis, feature engineering, model training, and performance evaluation. Results show that Random Forest achieved the highest F1 Score, demonstrating strong predictive reliability and balanced performance across classes. Feature importance analysis revealed that delivery‑related attributes, pricing characteristics, and product features significantly affect customer satisfaction.
The findings provide actionable insights for improving operational efficiency and customer experience, while highlighting opportunities for future research and model enhancement.
04 · Introduction
Why review outcomes matter.
Customer reviews have become a central component of modern digital marketplaces, influencing consumer decisions and shaping business reputation. As competition intensifies across industries, organizations increasingly rely on data‑driven methods to understand customer sentiment, identify operational weaknesses, and guide strategic improvements.
This project applies machine learning techniques to predict customer review ratings and identify the most influential factors driving customer satisfaction. The analytical workflow includes data preprocessing, exploratory data analysis, feature engineering, model training, and performance evaluation. By comparing multiple machine learning models, the study aims to determine which approach provides the most reliable predictive performance for business applications.
Beyond prediction, the project emphasizes the translation of analytical findings into actionable business insights. The results support evidence‑based decision making, enabling organizations to optimize operations, improve customer experience, and enhance overall business performance.
Theoretical foundation
From a data science perspective, customer reviews represent a classification problem in which the goal is to predict discrete outcomes based on measurable attributes. Machine learning provides a robust framework for modeling such relationships, as it can capture complex, nonlinear interactions between variables that traditional statistical methods may overlook. Algorithms such as Logistic Regression, Random Forest, and XGBoost are widely used in predictive analytics due to their ability to generalize patterns, handle large datasets, and provide interpretable insights through feature importance analysis.
Table of Contents
Five chapters, one investigation.
Ch. 01
Research Questions
Five guiding questions on customer satisfaction drivers, model selection, and business impact.
Ch. 02
Data & Methods
Olist e‑commerce dataset, preprocessing, feature engineering, and exploratory data analysis.
Ch. 03
Models & Results
Logistic Regression, Random Forest, XGBoost — accuracy, precision, recall, F1, and feature importance.
Ch. 04
Conclusion & Strategies
Findings, limitations, and five data‑driven strategies for operational improvement.