Data Processing
Contents
Data Processing¶
It is a long work flow from collecting raw data to presenting the statistical evidence of the final empirical results. The process is very much like cooking. (The key difference between cooking and empirical research the level of innovation. In daily food cooking no one blames you if you strictly follows a recipe, but, except replication works, research demands at least some level of originality.)
We come up with an idea what dish we would like to present.
We go to a supermarket to buy the materials such as meat and vegetables.
We clean and prepare the ingredient.
We heat the pot and fry the materials until they are edible.
We arrange the cooked food into a plate, decorate a little bit if needed, and present it onto a dining table.
Step 4 is the highlight as it turns the materials into editable food. However, it only takes a small fraction of time in the whole steps of food cooking. Most of the time is indeed spend in shopping the raw material and prepare. In this chapter, we talk about the underappreciated but extremely important step.
I would like to emphasize the replicability of empirical research. Empirical research is never a linear process: we need to constantly come back to revisit the prior steps. We need to guarantee that the empirical results replicable. This is a requirement for any scientific finding. Only if the results are replicable by the authors themselves and other researchers we can start a meaningful discussion of the research design and the credibility of the empirical findings. Unreplicable research is voodoo.
Data processing is an art instead of science. Just like writing, everyone can has his or her own style. However, some styles are inefficient and should be avoided. Here are some practical guidance that I find useful in my research, and I would like to share with you.
Data Collection¶
If the data is collected somewhere online, keep a log so that you know where it is downloaded so that we can easily repeat the process if necessary.
Keep the raw data without any editing. If editing is essential, save the edited data into another file. Manual editing should be avoided whenever possible; it is always recommended editing being done by a script. This is why code-driven software like R or STATA is preferred over menu-driven statistical software like Excel or SPSS.
If the data processing takes many steps and it takes time to run, you can save intermediate datasets as some milestones.
Write comments about what a script does. Keep a separate log.