Precision or Speed? The DeepSeek Dilemma

Mission Automate
Feb 26
3 min read

Large Language Models (LLMs) are revolutionizing data analysis workflows, but not all models perform equally. This blog explores the tradeoff between response accuracy and speed when comparing DeepSeek V3 and ChatGPT-3.5, revealing surprizing insights about their performance characteristics and practical applications in data analysis tasks.

Perplexity Score Comparison

Perplexity scores quantify the uncertainty a model experiences when predicting the next token (word, character, etc.) in a sequence. Lower scores indicate better prediction capabilities and generally correlate with higher response quality.

Our testing methodology can be summarized to the following sentence types:

General English sentences from everyday language
Ambiguous sentences (e.g., "The chicken is ready to eat")
Domain-specific content with specialized knowledge (mathematical formulas, medical terminology)
Informal, conversational language
Extended multi-turn dialogues

The results revealed a significant performance gap.

Metric	Mean Perplexity Score	Response Time (seconds)
ChatGPT-3.5	1.512	0.938
DeepSeek V3	1.088	4.384

DeepSeek V3 demonstrates superior predictive accuracy with approximately 28% lower perplexity scores. However, this comes at a substantial cost: response times nearly 4.7x longer than ChatGPT-3.5.

Data Analysis Capabilities

To evaluate practical data analysis applications, we tested both models using the top 100 rows of the StubHub price dataset. Our assessment focused on two key areas:

data preprocessing and cleaning
insights generation on the dataset

1 - Data Preprocessing and Cleaning

For data preprocessing tasks, we observed the models' approach to handling duplicates.

Here's what ChatGPT-3.5 suggested:

DeepSeek V3's output was similar:

Load the data into an SQL table
Handle missing values
Remove duplicates
Handle outliers
Normalize or scale numerical features
Encode categorical features
Perform feature engineering

According to the initial testing, both models were able to generate appropriate queries for data preprocessing and cleaning tasks when prompted with natural language requests. This suggests that the choice between them should be guided by specific project requirements around speed versus depth of analysis.

2 - Insights Generation

Both models were asked to analyze the StubHub dataset and provide insights. The differences in their analytical approaches and output quality revealed interesting contrasts in how these models process and interpret data.

ChatGPT-3.5 Analysis

ChatGPT-3.5 provided a basic overview of the dataset structure and content:

Identified the main columns: rawPrice, anonPriceWithFees, anonRawPrice, inventoryListingScore.seatQualityScore, availableTickets, eventHour, and inventoryListingScore.starRating
Recognized different price values that may represent different pricing strategies
Noted that quality scores and ratings indicate seating quality
Observed ticket prices ranging from $13.0 to $336.0
Identified some rows with very high availableTickets values
Pointed out that events occur at different hours
Suggested potential correlations between prices and seat quality/star ratings

The analysis concluded by recommending further visualization and analysis for deeper insights.

DeepSeek V3 Analysis

DeepSeek V3 provided a significantly more detailed and structured analysis, as elaborated below.

Dataset Overview:

Clear definitions of each column with detailed explanations
Confirmation of exactly 100 rows representing unique ticket listings

Key Observations:

Price Analysis: Prices clustered in specific ranges, showing that fees consistently increase final costs and anonymization slightly alters raw prices.
Seat Quality and Star Ratings: Higher seat quality and ratings dominate, but some lower ratings exist, indicating variability in customer satisfaction.
Available Tickets: Most listings offer a moderate number of tickets, with occasional bulk listings as outliers.
Correlations and Patterns: Higher prices align with better seats and ratings, while lower-rated tickets are cheaper and less available. Pricing patterns suggest tiered ticket categories.

DeepSeek V3 went beyond basic observations to discover meaningful patterns in the data. It correctly identified price clustering that indicated different ticket tiers—a valuable business insight that reflects the actual StubHub pricing structure. The model recognized common-sense correlations, such as higher seat quality scores being associated with higher prices, but formalized these relationships in a way that could inform business strategy.

Furthermore, DeepSeek V3 proactively suggested specific next steps and use cases, and also highlighted data quality considerations:

Potential use cases: Useful for price prediction, inventory management, and analyzing event timing effects on ticket pricing and availability.
Data quality considerations: No missing values, but outliers exist, and anonymization limits full transparency.

These insights demonstrate how DeepSeek V3 can fundamentally transform data analysis workflows by providing detailed, actionable intelligence from even a small dataset sample.

Conclusion

The choice between DeepSeek V3 and ChatGPT-3.5 boils down to a classic speed-versus-accuracy trade-off.

⇨ DeepSeek V3 excels at uncovering nuanced patterns and delivering highly accurate insights, particularly with complex real-world datasets as visible in our test. However, this depth of analysis comes at the cost of processing time.

⇨ Conversely, ChatGPT-3.5 offers rapid responses, making it ideal for quick exploratory analysis or interactive data investigations, though it may sacrifice some of the precision found in DeepSeek V3's output.

Ultimately, the optimal choice hinges on the specific project requirements: DeepSeek V3 for in-depth, time-tolerant analysis, and ChatGPT-3.5 for swift, responsive data exploration.