Categories
Uncategorized

Data science reading list for Tuesday, October 30, 2018: The sexiest job, most in-demand data skills, 4 ways the data scientist has evolved, and the null hypothesis and p-values

Is data science still among the sexiest jobs of the 21st century?

It was in a 2012 Harvard Business Review article that data scientist was declared “the sexiest job of the 21st century”. Is it still true six years later?

I’ll spare you the torment and give you the answer, which (naturally) appears at the end of the article:

The role of data scientists is and will remain a sexy profession for some time, partly due to its relative exclusivity, and the field of data science itself will no doubt remain an exciting space.

You may find the middle of the article a little more useful, as it lists qualities of good data scientists:

A good data scientist should be:

  • Adaptable: Data scientists must be willing to constantly upskill themselves to master advanced machine learning skills such as deep learning. While technical skills are fundamental for data scientists, it’s crucial for them to master communication skills too so they can easily interact with domain experts or business developers. Data scientists will need to develop a better understanding of the overarching business strategy and business challenges in real-world scenarios to create solutions for real problems.
  • Statistics at the heart: Data scientists must have quantitative capabilities to figure out multifaceted trends within a data set that may entail more than one million rows.
  • Detail-oriented: Data often have errors and discrepancies, and data scientists must identify and correct incomplete, incorrect or inaccurate data. It’s critical that data are clean, high-quality and unbiased to ensure the best output upon which to make business decisions.
  • Good programming skills: Programming skills, together with statistics, are critical. For statistical analysis to happen, data scientists need to know programming languages (such as Java, SQL, and Python) to break down the data set in more digestible formats.
  • Business knowledge: While it is important for data scientists to be technically capable, they must also be business savvy and understand the organisation’s business goals and objectives, so they can analyse the data to support business success.

The most in-demand skills for data scientists

Here are the two key graphs from the article:

From the end of the article:

Based on the results of these analyses, here are some general recommendations for current and aspiring data scientists concerned with making themselves widely marketable.

  • Demonstrate you can do data analysis and focus on becoming really skilled at machine learning.
  • Invest in your communication skills. I recommend reading the book Made to Stick to help your ideas have more impact. Also check out the Hemmingway Editor app to improve the clarity of your writing.
  • Master a deep learning framework. Being proficient with a deep learning framework is a larger and larger part of being proficient with machine learning. For a comparison of deep learning frameworks in terms of usage, interest, and popularity see my article here.
  • If you are choosing between learning Python and R, choose Python. If you have Python down cold, consider learning R. You’ll definitely be more marketable if you also know R.

Four Ways the Data Scientist Has Evolved in the 21st Century

These four ways are:

1. Data science is more applied than ever. What can be built and fit over a real-life scenario has the dreadful requirement of mattering. Modeling for modeling sake is no longer a thing, and best-fit diagnostics are less important than best-fit for the situation. If a model goes unused, it serves no purpose. We can no longer tolerate or afford the luxury of building models purely for R&D purposes without consideration of utilization.

2. The skill of computer use seems to have taken over the knowledge of applied statistics. Understanding the interior workings of the black box has become less important, unless you are the creator of the black box. Fewer data scientists with truly deep knowledge of statistical methods are kept in the lab creating the black boxes that hopefully get integrated within tools. This is somewhat frustrating for long time data professionals with rigorous statistical background and understanding, but this path may be necessary to truly scale modeling efforts with the volume of data, business questions, and complexities we now must answer.

3. Data scientists are not weird anymore. We’re seen as strategic inputs to the decision-making process, and our craft is becoming much more understood. This trend is evidenced by C-level positions at large companies, vertical alignment and paths for data scientists, and inclusion at the highest levels, as well as the many academic programs and emphasis now available globally. This appreciation and positioning can sometimes make the field appealing for what seasoned data scientists might call the “wrong reasons” such as corporate fame and value. I would argue that we really want professionals in the field with a thirst for the truth – the science should be about empirically answering questions, and powered by truth-seekers at their heart.

4. Data Science is becoming more widely recognized as both art and science. Understanding the importance of the human – machine integration and complementary decision-making skills from each appears to have made its way more squarely into our field of understanding.

Statistical Significance, the Null Hypothesis and P-Values Defined & Explained in One Minute

And finally, some material that’s more than just hand-waving: a quick explanation of what the null hypothesis and p-values are, all done in a minute, courtesy of One Minute Economics:

Categories
Uncategorized

Data science reading list for Monday, October 29, 2018: The worst data science article, 5 basic stats concepts you need to know, Bayes, democratization, and web scraping

A terrible “data skills” article that you should read, but only as a warning

I remember the hype that surrounded the web in the late 1990s. I also remember the copious amount of well-intentioned misinformation that made the rounds as writers attempted to capitalize on that hype. It’s now data science’s turn, if this bit of “advertorial” in Harvard Business Review — Prioritize Which Data Skills Your Company Needs with This 2×2 Matrix — is any indication.

Written by Chris Littlewood, chief innovation and product officer of filtered.com (I’m not going to help them by linking to their site), a company that purports to use AI to “lift productivity by making learning recommendations”, the article clearly highlight’s the author’s ignorance and HBR’s willingness to publish any article that has to do with data or data science. To the credit of the readers, a number of them registered with the site simply to be able to post comments pointing out how nonsensical the article was.

Treat this article as an object lesson in technology hype, as well a sign that data science skills are seen as valuable.

The 5 Basic Statistics Concepts Data Scientists Need to Know

Forget that the article mentioned above said that mathematics and statistics aren’t useful data skills — you can’t do data science without them! You’ll need to understand these 5 concepts (in addition to others):

  1. Statistical features
  2. Probability distributions
  3. Dimensionality reduction
  4. Under- and oversampling
  5. Bayesian statistics

This article in Towards Data Science provides a brief overview.

Data Skeptic: Bayesian Updating

One of the better data science podcasts out there is Kyle Polich’s Data Skeptic, which has been around since 2014 and has over 400 episodes. The podcast features short mini-episodes explaining high level concepts in data science, and longer interview segments with researchers and practitioners.

I’ve just started working my way through this podcast, and have used the example in episode 5, Bayesian Updating, to explain Bayes’ Theorem to people who avoiding studying probability and stats. Give it a listen, then check out the rest of the podcast episodes!

The Democratization of Data Science

Here’s a Harvard Business Review article on data science that’s actually worth reading:

Intelligent people find new uses for data science every day. Still, despite the explosion of interest in the data collected by just about every sector of American business — from financial companies and health care firms to management consultancies and the government — many organizations continue to relegate data-science knowledge to a small number of employees.

That’s a mistake — and in the long run, it’s unsustainable. Think of it this way: Very few companies expect only professional writers to know how to write. So why ask only professional data scientists to understand and analyze data, at least at a basic level?

Data Science Skills: Web scraping using python

Another article from Towards Data Science:

One of the first tasks that I was given in my job as a Data Scientist involved Web Scraping. This was a completely alien concept to me at the time, gathering data from websites using code, but is one of the most logical and easily accessible sources of data. After a few attempts, web scraping has become second nature to me and one of the many skills that I use almost daily.

In this tutorial I will go through a simple example of how to scrape a website to gather data on the top 100 companies in 2018 from Fast Track. Automating this process with a web scraper avoids manual data gathering, saves time and also allows you to have all the data on the companies in one structured file.

Categories
Uncategorized

Who’s Got .NET Framework 3.5?

Alexander McCabe wanted to know the adoption rates of the various .NET runtimes, from .NET 1.0 up to the current .NET 3.5. He took the data from the logs for the website for his quiz-building software, Question Writer, augmented it by including figures published in Joel Spolsky’s Business of Software forum in March 2008, and turned it into the chart below (click on it to see it at full size):

Chart showing .NET Runtime Versions Used by Visitors to the Question Writer Site, March 2008 and May 2009 - October 2009

According to the chart, usage of .NET 3.5 among visitors to the Question Writer site has been growing in leaps and bounds since the spring, from just under 22% in May of this year to the current 52%.

Naturally, this data comes with all sorts of caveats:

  • The October 2009 data is based on the first 12 days of October.
  • Only Internet Explorer reliably reports .NET version information in the user-agent string.
  • McCabe has a couple of contradictory explanations:
    • IE users may be more likely to have .NET installed because they use Microsoft software.
    • IE users may be less likely to have .NET installed because they may be less likely to install software and therefore might be less likely to have .NET installed.
  • Question Writer uses the .NET runtime and its site’s visitors may have .NET installed.
  • There were a few users using .NET 4.0; McCabe counted them as .NET 3.5 users.

I should try the same exercise using the logs for Global Nerdy, which has a rather mixed audience of open source, Mac and Microsoft types. I wonder how different the results would be.

This article also appears in Canadian Developer Connection.

Categories
Uncategorized

Don’t Believe Your Web Stats

Don’t Believe Your Web Stats is a great article that covers how web visits are logged, how spiders and bots behave, how user behaviour can skew your stats and untimately, why you shouldn’t put too much faith in stats based on server logs.