Fork me on GitHub

Data science, roles, and barriers

In keeping with my recent data science theme, Ben Lorica has a nice post up on how to nurture data scientists. While his post focusses on data scientists in commercial organizations, the post has some very relevant points for bioinformaticians.

After working in companies both large and small, it’s clear to me that the strict separation of tasks is the major obstacle faced by data scientists. The most common manifestation is the separation between data analysis and data management. In many large companies, most analysts/statisticians have to wait for data from a designated data warehousing team, and in a lot of cases they wait for data from multiple owners of different data warehouses.

Neil pointed to another bit

To nurture data scientists, companies need to focus more on culture and organizational structure. Many data workers have enough skills and training to quickly become productive in multiple areas of data intelligence. The problem is that most don’t work in environments that encourage them to become data scientists. They’re stuck in silos and limited to one or two areas of data intelligence. Often, they’re restricted to use tools “approved” by their managers.

When I wrote about the biological data scientists, one of the points I was trying to make was that advanced data analysis requires a set of skills that are not easily transferable to bench scientists or biologists who want to retrieve specific, limited, pieces of informations. While at a particular point in time different people in a company or lab might focus on different aspects of the task, a data scientist needs to be capable of carrying out different aspects of the task. Ben covers these different aspects as well.

  • data acquisition: this might entail writing custom parsers and web crawlers, or scripts that target specific web services or API’s for non-traditional data sources.
  • data management: ETL, manipulate, query, and maintain data in databases, key-value stores, or Hadoop.
  • information visualization: uncovering patterns through the use of static visualization toolkits and/or interactive platforms based on Flash or Javascript.
  • analytics: this can range from simple to complex techniques in multivariate statistics, machine-learning, and NLP.
  • insight: extract, summarize, and present key findings to a broad audience.

We need to rethink how we approach data as bioinformaticians. It will not be acceptable to be considered a bioinformatician with skills that don’t cover at least a good chunk of the skills described above. But to get there, we need to change our idea of the role of bioinformaticians and of the value of the large volumes of data we are generating.

This entry was posted in Big Data, Informatics, Life Science. Bookmark the permalink. Post a comment or leave a trackback: Trackback URL.

2 Trackbacks

  1. By Twenty queries on July 25, 2010 at 23:02

    [...] Data science, roles, and barriers (mndoci.com) [...]

  2. By Data geeks and biology on August 8, 2010 at 11:29

    [...] Data science, roles, and barriers [...]

Post a Comment

Your email is never published nor shared. Required fields are marked *

*
*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

blog comments powered by Disqus
  • Archives

  • Disclaimer

    All opinions on this blog are my own and do not reflect those of my employers, past or present