5.08.2014

Data science anti-pattern: The SQLoppelganger

Data scientists, attention! The time has come to call out one of the egregious anti-patterns of data science.

I call it… the SQLoppelganger. (Pronunciation: skwop-ull-gang-ur.)

Definition: A SQLoppelganger is a database query (or other analytics code) that reproduces business logic that already exists somewhere else.


If you’re writing a query that does the same thing as a function that already exists in someone else’s code, you are SQLoppelganging.

If you’re spending time re-creating a data object that was passed to or returned from a function or API, you are SQLoppelganging.

If you’re modeling the internal state of a client or server from incomplete information, you are almost certainly SQLoppelganging.

Perils of the SQLoppelganger

SQLoppelgangers are mischief makers. They sprinkle mistakes and glitches throughout your analysis. Sometimes these glitches lie dormant for weeks or months before germinating and corrupting your data. (“Wait, why does that dashboard say that metric X dropped by 70% last week?”)

SQLoppelgangers feed on wasted time. They get fat by derailing development timelines---sometimes by a few hours, sometimes by many days.

SQLoppelgangers are also emotional pests. Like ghosts and dogs, SQLoppelgangers sense uncertainty and tension. Beware! They will subtly exploit both to upset the harmony of your organization.



By replicating business logic in different parts of the codebase, SQLoppelgangers break the basic rules of object orientation: loose coupling, separation of concerns, and especially Don’t Repeat Yourself. These are cardinal sins of OOP, clear signals that your code is due for a refactor. Code with duplicated logic is counterintuitive, error prone, and difficult to maintain. It’s “brittle”—and unnecessarily so.

On top of that, the verminous SQLoppelganger carries several nasty secondary problems of its own.

For one thing, the code for SQLoppelgangers is often written in different languages (e.g. python on the server/SQL in the data pipeline), making it extremely difficult to verify that it’s working properly. Unit tests are usually impossible—if you had the data to perform a unit test, you wouldn’t be replicating the business logic.

Second, the code replicated by a SQLoppelganger is usually owned by different teams, which makes it harder to sync up on the latest version of the logic. Priorities and timing might not align; it can take a lot of time, friction, and pain to fix the problem. In the meantime, the data team feels blocked by the server team, and the server team doesn’t understand why the data team can’t just wrap up that analysis.

Third, for data products that must be aware of history (e.g. a dashboard comparing year-over-year KPIs), SQLoppelgangers must be maintained indefinitely.

Avoiding SQLoppelgangers isn't a matter of style or taste. It’s just good practice.

In just a moment, we’ll talk about how to squash SQLoppelgangers. First, it’s worth taking a moment understanding where they come from.

The SQLoppelganger’s habitat

Small SQLoppelgangers are part of any normal agile ecosystem. If you’re building new stuff and learning on the fly, there will be times when your data is incomplete and/or awkwardly formatted.

The good news is that—like alligators—small SQLoppelgangers are not dangerous. They can often be banished with a quick “sure, let me add that field to the table/json object/logfile.” Sometimes, for special one-time queries, it’s best just to feed the beast and move on. (“Um, that merge key doesn’t exist, but I can get around it by JOINing through this other table. Give me five more minutes.”)

However, SQLoppelgangers tend to grow over time, until wrestling them can become a nightmare. I’ve seen organizations where data teams spend the majority of their time reverse-engineering non-transparent business logic. (Ask me some time about the Michigan REP.) Huge time sink, huge morale-killer, huge drag on your capacity to turn data into awesome.



In my experience, SQLoppelgangers are especially common in organizations with (1) growing capacity in data and analytics, and (2) long-running production systems. As data workers sift through production data systems, they will inevitably find data structures that don’t fit the needs of analysis and new data products. This is a normal, but frustrating, reflection of the fact that client-to-server queries in production are typically very different from the types of queries used run in batch-mode analytics.

Because of this fact, the same optimizations that served the company very well in the past can end up creating SQLoppelgangers when a data team begins turning that data exhaust into business intelligence and data products. You can think of it as a kind of accidental premature optimization: production systems were usually created and tuned with one audience (external consumers) in mind, and now they’re being called on to serve another audience (internal data teams).

These are solvable problems, but they take smart engineering, and good communication between teams.

Pest control for SQLoppelgangers

Here’s an incomplete list of approaches to taming/exterminating SQLoppelgangers. Each of these could be a blog post in its own right. For now, I’ll just try to convey the gist.

Just because the data exists, don’t assume that it’s queryable. Building queries takes time. Figuring out someone else’s logic takes time. There’s a big difference between “it is technically possible to write a query that does X,” and “an analyst can come up to speed on the business logic, edge cases, and data pipeline; then write and debug a query that does X, using imperfect tools and sometimes-breaky infrastructure, amid many other competing demands, before the deadline on Wednesday.” It’s the difference between theories of frictionless spherical bodies, and the mess and violence of an actual rocket launch. "Theoretically possible" does not imply “actually doable” when deadlines are involved.

Communicate about time sinks. SQLoppelganging often trades 1 unit of software engineering time for 10 units of data science time. From an organizational perspective, those kinds of trades just aren't worth it. Data scientists need to communicate, politely and clearly, about the sources of drag in their workflow. Software engineers need to respect and accommodate that feedback.

Define and document your schemas, including the logic behind them. One of the biggest time sinks of SQLoppelganging is catching engineers to explain the business logic that generates data. For even mildly complicated business logic, it often takes multiple rounds of Q&A to get clear answers for all the edge cases. Schedules never seem to align, so the data scientists can wait, blocked,  for hours or days between cycles. Sometimes data scientists can read the source code, but that can also be very time-consuming and error-prone, especially when the data trail crosses many systems (“That event is created on a user’s iPhone, then it makes a round trip to the server through the Foo API, then the result gets logged from the client through the Bar API, then we put it into kestrel and…”). Good documentation of expected behavior can easily make this process 10 or 20 times faster.

When business logic is uncertain, schemas should stay flexible. It’s rarely worth the trouble to define a rigid schema around uncertain logic that will change in the near future. Instead, use a flexible data object (e.g. json in text blobs, instead of a table columns; or MongoDB instead of MySQL). In your code, use comments that say things like “//??? Not sure which fields will matter for this algorithm. Revisit and refactor once we have time to explore it.”

Don’t cling to DB normalization. DB normalization has its place, but it can sometimes be more trouble than it’s worth. Especially for immutable tables, it’s often better to store a little extra information to make the analysts’ jobs easier.



Log everything. Storage is cheap these days, and only getting cheaper. The more points at which you log user interactions and strands of internal logic, the less likely you are to end up wasting time SQLoppelganging on a deadline.

Log context and consequence together. When logging user behavior, record exactly what the user sees (e.g. the data object for the current app screen, web page, game history, etc.) along with the action the user took. When logging APIs or internal functions, log the input and the output together, in the same object. If this is impractical (e.g. long session histories with many interactions), make it abundantly easy to merge and re-create the context objects for each event. Include the right object keys, with unambiguous names. Record timestamps on both the client and the server.

Empower data scientists to log events. Engineers are often skittish about letting data scientists work on production systems (sometimes with good reason!) Still, if you have a good logging system set up, it can be very constructive to make an exception for writing custom log entries. Branch your staging repo and let the data team add logging events as necessary. Do a quick code review and merge on whatever cadence makes the most sense. This is the code equivalent of giving the data team a read-only replica of the production data system to query. Empowering data scientists to build out the logging system takes a certain level of infrastructure, communication, and trust, but it’s an incredibly effective way to stamp out SQLoppelgangers. It lets the data team fish for themselves, and get more familiar with the production codebase. Once you’ve got a good system in place, brings the engineering and data teams together and helps both get more done faster.

Plan for a round or two of data format iteration. Recording/logging the right data for analysis is often a learning process, which means that you usually don’t get it exactly right on the first try. The best way to guarantee that the right data get collected is to make room to iterate on the format once or twice for unfamiliar/tricky cases. The alternative is to define a schema and make it a part of your unit tests and QA process, but that only works if the business logic and relevant context are already well-defined. Iterating is usually better.



Done
Okay, that’s all I’ve got. I’ve you have anything to add—questions, advice, SQLoppelganger sightings in the wild—feel free to add them in the comments.

4 comments:

  1. The "SQLoppelganger" is the violation of a well known software engineering principle - DRY - Don't Repeat Yourself.

    As such, the most obvious solution to this is to put the common query into a view or a stored procedure that is accessible to all.

    Also:

    "Don’t cling to DB normalization"

    This is almost never the right answer.

    ReplyDelete
    Replies
    1. I agree with your first point. It can be tricky to share code/logic across teams and codebases, but it's usually worth the trouble to do it.

      On DB normalization, I 'm going to have to disagree. One of the main things we've learned from the NoSQL movement is that duplicating data can be very healthy under the right conditions. I'd suggest reading up on materialized views, immutability, and best practices in logging. Picking the right times to denormalize is still more art than science, but I'm convinced that the answer is rarely "never."

      Delete
  2. Some business logic is encapsulated in a form that SQL tools can't use. There may be good reasons for choosing that form, but usually the logic needed can be encapsulated into some form of separate function. Once the needed logic in shareable, use the native host language to update the sql data.

    And many times "Don't cling to DB normalization" is the right answer. When data science seeks discovery, optimizing a one-time query is pointless. If it will run in an hour, why spend four hours to make it run in a minute? duh!

    ReplyDelete