Gain insights into game data science by learning data extraction, visualization, clustering, supervised learning, neural networks, and sequence analysis using R, impacting business decisions in social games.

final.tar.gz

R_job

R_gaming

PCA_R

R_Exp

Dota

Dota-copy

dota_live

Game data science is emerging as a significant field of study due to the emergence of social games embedded in online social networks. The ubiquity of social games gives access to new data sources and impacts essential business decisions, given the introduction of freemium business models. Game data science covers collecting, storing, analyzing data, and communicating insights.

This course will teach you game data extraction, processing, data abstraction, data analysis through visualization, data clustering, supervised learning, neural networks, and advanced sequence analysis using a case study and many examples in the R programming language.

Game Data Science Using R

When we start working with behavioral telemetry data from games, we'll see that the raw data collected is often in the order of 50+ features or independent/dependent variables that are measured continuously throughout the duration of a play session. On top of this, there will usually be many play sessions associated with each player. Often, each data point also relies on other data points. For example, quest completion relies on being in the right location, having the right items, etc. 


Analyzing such high-dimensional data can be challenging. Especially if we also need to take into account the temporal dimension, for example, in a time-series analysis. There are many interactions and interdependencies between these variables that would also have an effect on the statistical analysis we'll be performing.

# How is behavioral telemetry collected?
Furthermore, it's important to remember that behavioral telemetry is collected from players as they play. The more information we collect, the sparser the data will likely be across those dimensions because not all participants actually go through all the game spaces, especially in open-world games. This sparsity can make it hard to develop accurate models for specific problems. For example, it would be hard to analyze player movement patterns in an area that few players have visited.

For all these reasons, we have to be strategic about the information we’re collecting. Instead of attempting to deal with such high-dimensional data, a very common strategy is to develop an abstraction from the raw variables to a higher level with fewer variables, which reduces the dimensionality of the data and provides useful information about player behavior. We refer to these variables as features or metrics. We use these two terms interchangeably to mean a variable of interest (or an independent variable, as we called them in the  [previous](https://www.educative.io/courses/game-data-science-using-r/introduction-to-inferential-statistics#Dependent-and-independent-variables) chapter) from the abstracted, raw data/measures.

# Usage of abstraction
Abstractions can have several purposes. For example, using abstraction methods, we can condense time but keep the sequential nature of the measures, aggregate over the temporal dimension (that is, removing time as a dimension), or develop new abstract variables that are functions of the variables in the raw data, thereby condensing the number of variables into a more manageable set. To take an example, the kill/death ratio is a common feature/metric developed for shooter games. Other good examples of abstractions over raw data points are provided in the list of game metrics discussed previously. Please review this list as it is extensive and can give you an idea of what we are trying to achieve in this chapter.

In this chapter, we'll introduce the process of creating such features and metrics from raw data. There are many different strategies to accomplish this. These strategies can be summarized into three processes:

* **Feature engineering** refers to the process of using domain or expert knowledge to aggregate data and develops new features. Examples of this process are metrics discussed in [Introductory chapter](https://www.educative.io/courses/game-data-science-using-r/what-are-metrics-in-game-data-science). Other examples can include averaging _kills per match per player_ and _time spent on each location_, where location is defined as an area in the game map.

* **Feature extraction** refers to the process of developing new features using statistical techniques from raw measures reducing the number of variables by obtaining a set of principal variables. Therefore, feature extraction derives new features $F_1, /cdots , F_m$, which are new variables obtained statistically from the raw variables $X_1 , \cdots , X_n$ .  A method that allows us to perform such extraction is Principal Component Analysis (PCA), which we’ll discuss in detail in this chapter. It should be noted that these types of techniques produce features that may not be interpretable by humans.

* **Feature selection** This refers to the process of filtering the raw measures and selecting a few that are of interest, thus reducing the number of variables that can be used for further analysis. This process is usually done through statistical methods that allow us to rank or score the importance of features given a prediction or outcome variable, such as whether the player won or not. As opposed to feature extraction, feature selection selects specific variables from the raw variables, $X_1, /cdots , X_n$, owing to their importance for modeling a particular relationship with a target variable $Y$. Therefore, the new variables are a subset of the raw variables, while the feature extraction technique develops new variables from the raw variables.

# Chapter overview


In this chapter, we’ll discuss some techniques in detail. We’ll present some of the algorithms used and explain how such algorithms can be used through labs in R. We’ll focus on the latter two techniques, feature extraction and selection, rather than feature engineering. This is due to the fact that feature engineering is a technique that is often game dependent and requires expert knowledge. Moreover, for feature engineering, we mostly use scripting to develop aggregate measures using similar functions to what we’ve discussed in the previous chapters. Therefore, we’ll keep it as an exercise for us to use Virtual Personality Assessment Lab (VPAL) data to engineer some features that may be useful for our analysis goal. This chapter includes the following labs:

* [PCA](https://www.educative.io/courses/game-data-science-using-r/introduction-to-feature-extraction#A-practical-example-of-applying-PCA) lab: Focuses on feature extraction with PCA.
* [PCA mix](https://www.educative.io/courses/game-data-science-using-r/how-to-deal-with-nominal-and-ordinal-measures#Run-PCAmix) lab: Extends the techniques used in the previous lab to include mixed data: qualitative and quantitative.
* [Feature selection](https://www.educative.io/courses/game-data-science-using-r/the-process-of-feature-selection#Forward-search-algorithm) lab: Focuses on feature selection showing forward and backward feature selection methods with example game data.


It should be noted that some of these algorithms are based on machine learning techniques, which we'll introduce in more detail later in the course. For such cases, we will not delve deeply into the techniques but just introduce them and show how to use them, referring to the relevant chapters for more details. When such algorithms are discussed in later chapters, we recommend coming back to this chapter and considering how this added knowledge impacts our understanding of data abstraction. Before we delve into the subject of this chapter, we'll first discuss the dataset we'll be using throughout this chapter for examples and labs.




Learn the importance of behavioral telemetry in game data science.