Eggs and Baskets: Lessons on Data Foraging
It’s been a (long) while since my inaugural post on my Data Science Fellowship project. This post takes the form of a piece of advice for other soon-to-be data gathers, and it comes down to this: don’t put all your eggs in one basket.
It sounds cliché, and—in retrospect at least—extremely obvious. But it is an important piece of advice nevertheless. What I’m talking about is the way in which we secure data for research projects.
Nick and I built our proposal around a single database. Before submitting an official request for data, we talked with the folks that ran the database, we solicited advice from others who have worked with text as data, and we thought carefully about what we were requesting. We knew the approximate timeline for receiving a data set, and we worked out a deadline that would allow us enough time to complete our analysis. I felt that we were being thorough and doing the best we could to ensure quick delivery of our data. Originally, the developers at the database told us to expect the data as much as a month in advance of that deadline. That deadline has long come and gone. Recently, we learned that due to some complexities of our request, they may not be able to deliver the data set before our fellowship concludes.
(If I am able to speak with the developers, I’ll update this post at a later with more details about exactly what made extracting a data set so difficult. Right now, what I’ve come to understand is that the size of our request may have been unprecedented and revealed some problems with running large inquiries.)
My advice is to pursue as many sources as possible until you have a workable data set in your hands. This is especially important if you are working with a strict deadline, as so often the case for those of us writing dissertations or completing fellowships. I recognize that there may not always be that luxury, but if there is more than one potential data source for your work, try to work with more than one.
I want to emphasize that this post is in no way meant as a criticism of the developers who tried to deliver our data. Sometimes, despite the best of intentions, things don’t work out. Nick and I had great interactions with our contacts there, and we do feel that they did everything in their power to meet our deadline. I would not hesitate to request data from them again in the future. The silver lining is that we do expect to get the data eventually and hope to execute our proposal at that point (though this will be after the our institutional support comes to an end). While Nick and I had discussed and prepared for a number of obstacles to our research, failure to gather data was not one of them. You can be sure that the next time I go data gathering, I’ll take more than one basket.