Vega-Lite for data exploration

Posted on April 19, 2020 in
4 min read

This is a follow-up of my previous post dedicated to Vega-Lite.

I want to make practice on Vega-Lite using the Sochi Olympics Game dataset that contains all the athletes information.

Making a scatterplot with Vega-Lite is, after a bit of exercise, dead simple.

Let's try to refine more in order to make it useful for data exploration.

Athletes weight/height

I want to show the athletes distribution according to height/weight properties, a perfect task for a scatterplot:

{
  "$schema": "https://vega.github.io/schema/vega-lite/v4.json",
  "width": 400,
  "data": {
    "url": "https://fabiofranchino.com/disk/datasets/athletes_sochi.csv",
    "format":{
       "type": "csv"
    }
  },
  "mark": "point",
   "encoding": {
     "x":{
       "type":"quantitative", "field": "weight"
     },
     "y":{
        "type":"quantitative", "field": "height"
     }
   }
}

Since there are some missing values, I want to filter out the data points with missing height and weight, adding a filter in the transform array:

"transform": [
  {
    "filter": "datum.weight > 0 && datum.height > 0"
  }
]

A filter can be also a list of filters, which is much more readable:

"transform": [
  {
    "filter": "datum.weight > 0"
  },
  {
    "filter": "datum.height > 0"
  }
]

By adding one more filter we can show only a specific country:

{
  "filter": "datum.country === 'Italy'"
}

But now I want to manipulate the scale of the x axis because a lot of space is not user by the scatterplot. We can set a different domain per single encoding:

"x":{
  "type":"quantitative", 
  "field": "weight",
  "scale":{
    "type":"linear",
    "domain":[45, 105]
  }
},
"y":{
  "type":"quantitative", "field": "height",
  "scale":{
    "type":"linear",
    "domain":[1.5, 2]
  }
}

So far so good.

I've tried to compute the domain dinamically without success. Let's see if it'll be something feasible in next attempts.

Gender

Now, I want to show the gender comparison using a donut chart.

The first thing to do is transforming the dataset in order to have the useful values for the encoding part, thus, here an aggregate transform to do that:

"transform": [{
  "aggregate": [{
    "op": "count",
    "field": "name",
    "as": "num" 
  }],
  "groupby": ["gender"]
}]

Then, let's add the mark compatible with the graphic element we want to create:

"mark":{
  "type": "arc",
  "innerRadius": 40
}

And finally, the encoding that uses the new calculated property num:

"encoding": {
  "theta": {
    "type": "quantitative",
    "field": "num"
  },
  "color":{
    "type":"nominal",
    "field": "gender" 
  } 
}

Full source code here.

There is still something I want to add, that is a label close to each slice. Since it's something we need to add further, let's use the layer capability to achieve the desired result. Let's add an additional mark

"layer": [
  {"mark": {"type": "arc", "innerRadius": 40}},
  {
    "mark": {"type": "text", "radius": 120},
    "encoding": {"text": {"field": "gender", "type": "nominal"}}
  }
]

And we need to move the encoding of the arc outside the layer:

"encoding": {
  "theta": {"type": "quantitative", "field": "num", "stack": true},
  "color": {"type": "nominal", "field": "gender"}
},
"layer": [
  {"mark": {"type": "arc", "innerRadius": 40}},
  {
    "mark": {"type": "text", "radius": 120, "color":"black"},
    "encoding": {"text": {"field": "gender", "type": "nominal"}}
  }
]

Source code.

A couple of note here. I make it to work following this example from the official website, where I've learned two mandatory things:

  • Moving the encoding outside the layer
  • Using the stack property in the theta

Not sure if I really understand the logic behind, maybe it's something I'll see later.

While I couldn't be able to change the text color even using the valid property in the mark definition. Again, not sure if it's a bug or something that needs to be done in a different way.

Again, so far, so good.