Cleaning data can be hard, the best plan is to avoid it.
There's a delicate balance of planning and action with building software. Plan too much and nothing gets shipped, or things get shipped at a pace that's not enjoyable for anybody involved. Build too quickly and you end up in crushing cycle of fixing mistakes.
One item that should involve more planning is your data model. In many situations these are the hardest things to change. They end up impacting many additional layers and they changing them can end up causing site outages. So plan them up front.
Specifically think about your constraints. What restrictions can you enforce on the data layer (because trust me, leaving it to your software to do the right thing will expose you to hard to find bugs). A very useful constraint is the
unique constraint. Most databases have this feature. An index is created and a field or set of fields must be unique between records.
If you think of a
User table, you'll likely want to have a
unique index on
username. Otherwise how do you log in?
In a more complicated fashion, you might have a table that joins two users together and it doesn't make sense to list a pair twice. For example a
BFFs table that list best friends:
User | Best Friend ================== Nadja | Laszlo Nadja | Nandor (ok to list Nadja twice) Colin | Laszlo (ok to list Laszlo twice) Nadja | Laszlo <-- Don't repeat
What's bad about this?
- It can look sloppy, imagine Nadja's BFFs page and Laszlo was listed twice?
- Your actions may seem broken. If Nadja tries to comment on her friendship, it might not show up correctly because she might be interacting with the wrong entry in the BFF table.
These types of data errors are hard to prevent without a well thought out data model. Even then, things can get through and really prevention at the database layer is the best level of defense. Postgres, MySQL and even MongoDB have some form of Unique indices which can ensure that you don't duplicate yourself.
Cleaning the data can be hard, and on a well trafficked site, that can mean:
- Cleaning the data
- Patching the holes, or adding the right indices
With data, it's always good to model it out ahead of time, and think of constraining things tightly and loosening them as needed.
Share with me your data disasters! firstname.lastname@example.org