In keeping with my recent data science theme, Ben Lorica has a nice post up on how to nurture data scientists. While his post focusses on data scientists in commercial organizations, the post has some very relevant points for bioinformaticians.
After working in companies both large and small, it’s clear to me that the strict separation of tasks is the major obstacle faced by data scientists. The most common manifestation is the separation between data analysis and data management. In many large companies, most analysts/statisticians have to wait for data from a designated data warehousing team, and in a lot of cases they wait for data from multiple owners of different data warehouses.
Neil pointed to another bit
To nurture data scientists, companies need to focus more on culture and organizational structure. Many data workers have enough skills and training to quickly become productive in multiple areas of data intelligence. The problem is that most don’t work in environments that encourage them to become data scientists. They’re stuck in silos and limited to one or two areas of data intelligence. Often, they’re restricted to use tools “approved” by their managers.
When I wrote about the biological data scientists, one of the points I was trying to make was that advanced data analysis requires a set of skills that are not easily transferable to bench scientists or biologists who want to retrieve specific, limited, pieces of informations. While at a particular point in time different people in a company or lab might focus on different aspects of the task, a data scientist needs to be capable of carrying out different aspects of the task. Ben covers these different aspects as well.
- data acquisition: this might entail writing custom parsers and web crawlers, or scripts that target specific web services or API’s for non-traditional data sources.
- data management: ETL, manipulate, query, and maintain data in databases, key-value stores, or Hadoop.
- information visualization: uncovering patterns through the use of static visualization toolkits and/or interactive platforms based on Flash or Javascript.
- analytics: this can range from simple to complex techniques in multivariate statistics, machine-learning, and NLP.
- insight: extract, summarize, and present key findings to a broad audience.
We need to rethink how we approach data as bioinformaticians. It will not be acceptable to be considered a bioinformatician with skills that don’t cover at least a good chunk of the skills described above. But to get there, we need to change our idea of the role of bioinformaticians and of the value of the large volumes of data we are generating.
Related articles by Zemanta
- The Biological Data Scientist (mndoci.com)
- Data-driven research products (mndoci.com)




2 Trackbacks
[...] Data science, roles, and barriers (mndoci.com) [...]
[...] Data science, roles, and barriers [...]