Python Medium

N+1 Query in ORM Lookup Loop

Loop over users executes one DB query per iteration instead of batched fetch.

Repository reporting-api

PR Title Add per-user analytics aggregation

Author @aleksandr_b

Files changed 2

Diff size +67 / -14 lines

Changed file

analytics/aggregation.py

from sqlalchemy import create_engine
from models import User, Session, Payment
def get_user_analytics(user_ids: list[int]) -> dict[int, dict]:
    result = {}
    for uid in user_ids:
        sessions = Session.query.filter_by(user_id=uid).all()
        payments = Payment.query.filter_by(user_id=uid).all()
        result[uid] = {
            "session_count": len(sessions),
            "total_revenue": sum(p.amount for p in payments),
        }
    return result
from sqlalchemy import and_
def get_user_analytics(user_ids: list[int]) -> dict[int, dict]:
    if not user_ids:
        return {}
    sessions_by_user = {
        s.user_id: s
        for s in Session.query.filter(Session.user_id.in_(user_ids)).all()
    }
    payments_by_user = {
        p.user_id: p
        for p in Payment.query.filter(Payment.user_id.in_(user_ids)).all()
    }
    result = {}
    for uid in user_ids:
        user_sessions = [s for s in sessions_by_user.values() if s.user_id == uid]
        user_payments = [p for p in payments_by_user.values() if p.user_id == uid]
        result[uid] = {
            "session_count": len(user_sessions),
            "total_revenue": sum(p.amount for p in user_payments),
        }
    return result

Codio Review

Summary

The original function had a textbook N+1 query problem — for each user ID in user_ids, it fires two separate ORM queries (Session.query.filter_by(user_id=uid) and Payment.query.filter_by(user_id=uid)). With 100 user IDs, that's 200 database round-trips before the first result is returned. For a reporting endpoint, this is a significant performance regression waiting to happen. The fixed version correctly batches the fetches using .in_() clauses, reducing the query count from 2N to 2 regardless of input size. However, the Python-side grouping logic in the fixed version has a correctness issue that makes it O(N²) in the wrong place.

Findings

Finding #1 — Python-Side Grouping Is O(N²) Medium

The SQL fix is correct — .in_(user_ids) fetches all sessions and payments in two queries. But the Python grouping logic is inefficient:

user_sessions = [s for s in sessions_by_user.values() if s.user_id == uid]

sessions_by_user is a dict keyed by user_id, so the lookup should be O(1): sessions_by_user[uid] (or handling the list case if multiple sessions exist). Instead, the code iterates over .values() for every user, which is O(total_sessions) per user. With 10,000 sessions and 100 users, that's 1,000,000 iterations instead of 100 direct dict lookups.

If each user can have multiple sessions, the correct structure is defaultdict(list) keyed by user_id, then sessions_by_user[uid] is O(1). If each user has exactly one session, sessions_by_user[uid] is already correct.

Finding #2 — sum(p.amount for p in payments) Throws on None Medium

If any Payment.amount column is NULL in the database, sum(...) silently skips the NULL (SQL aggregate functions ignore NULLs), so this won't crash — but it may undercount revenue if NULL payments exist in production. More importantly, if the ORM maps Payment.amount to a Python None and the model field is typed as Decimal, the sum() will raise a TypeError at runtime when it tries to add Decimal(50) + None. Verify the column constraint — if NULL payments can exist, handle them explicitly with filter()` or coalesce in the SQL query.



        
          
            Finding #3 — No Pagination for Large user_ids Lists
            Low
          
          
            SQLite and PostgreSQL have limits on the number of values you can pass to IN (...) clauses — PostgreSQL's default is around 32KB of query text, which translates to roughly 1,000–2,000 integer IDs per query. If user_ids can be large (e.g., a bulk export for all users), the .in_() query will fail with a "too many arguments" error. Consider chunking the IDs into batches of 1,000 and running multiple queries, then merging the results.



      
        Recommendation
        The SQL-side fix is correct and solves the original N+1. Addressing Finding #1 first — replace the Python iteration with proper dict lookups. This makes the function O(2 queries + N dict lookups) instead of O(2 queries + N×M Python iterations). Finding #2 is a data integrity concern worth a quick check of the Payment model schema. Finding #3 is an edge case for large batch exports — worth noting but not blocking for normal usage.
      

      
        Suggested Fix
        
          analytics/aggregation.py
          from collections import defaultdict
from sqlalchemy import and_
from models import User, Session, Payment

def get_user_analytics(user_ids: list[int]) -> dict[int, dict]:
    if not user_ids:
        return {}

    BATCH_SIZE = 1000
    sessions_by_user = defaultdict(list)
    payments_by_user = defaultdict(list)

    # Batch the IN queries to avoid PostgreSQL argument limits
    for batch in [user_ids[i:i+BATCH_SIZE] for i in range(0, len(user_ids), BATCH_SIZE)]:
        sessions = Session.query.filter(Session.user_id.in_(batch)).all()
        payments = Payment.query.filter(Payment.user_id.in_(batch)).all()
        for s in sessions:
            sessions_by_user[s.user_id].append(s)
        for p in payments:
            payments_by_user[p.user_id].append(p)

    result = {}
    for uid in user_ids:
        user_payments = payments_by_user.get(uid, [])
        total = sum(
            (p.amount or 0) for p in user_payments
        )
        result[uid] = {
            "session_count": len(sessions_by_user.get(uid, [])),
            "total_revenue": total,
        }
    return result



    
      
        
          
        
        
          Why Codio flagged this
          Codio identifies N+1 queries by analyzing ORM patterns inside loops — when a query is constructed inside a for loop over a collection derived from user input, it's a performance issue waiting to explode under load. The original N+1 is obvious. The O(N²) Python grouping in the "fixed" version is the subtler issue Codio catches — the author solved the SQL problem but introduced an iteration-in-iteration that will slow the function proportionally to the number of rows returned. Codio flags both.
        
      
    

    
      
        
          Get Codio reviewing your PRs
          Ship reviews like this on every pull request. No reminders. No missed PRs.
        
        Join the waitlist →
      
    

    
    
      
        
          ← Previous
          Unhandled Error in HTTP Client Retry Logic
        
      
      
        
          Next →
          Standards Violation: Missing Input Validation