Difference between revisions of "What does clustering tell us"

Revision as of 12:48, 10 June 2022

Project Infobox
Self researcher(s)	User:Gedankenstuecke
Related tools	Oura Ring, Fitbit, Apple Watch, RescueTime
Related topics	Sleep, Activity, Weight tracking, Productivity tracking
Builds on project(s)	100 Days of Summer
Has inspired	Projects (0)

"What does clustering tell us" is a personal science project that is still ongoing and that is the result of discussions during the weekly self-research chats. The goal is to understand whether unsupervised clustering of a large number of metrics can lead to a better understanding of how different metrics relate to each other and if it shows any interesting clusters in how different days are similar to each other.

Background

The idea for doing this project came up during the spring of 2022 when discussing potential projects for the annual Keating Memorial, when some participants in the self-research chats brainstormed whether or how one could learn more from a large set of data across a number of topics and metrics.

During this discussion a suggestion was to see whether unsupervised clustering could help uncover which variables correlate with each other while also highlighting whether there are different types of days. A search in the Show & Tell archives showed that a similar approach had already been tried in the 100 Days of Summer project.

Preparing data for clustering with Principal Component Analysis

To give it a first try, I decided to go ahead and use some of my data to see if such a clustering could work using a simple principle component analysis (PCA). In order to limit the scope I decided to use data from a variety of sources. To simplify the approach, I decided to use the day as the unit of observation. For this, I either summed up or averaged measurements throughout the day, depending on the metric (see Table below).

Metrics used

A total list of 37 different metrics to be used for this work.

All variables (so far) used for this project
Data source	Metric type	Variable	Details
Oura Ring	Activity	Daily movement	activity measured in "walking distance equivalent", single value per day
		Steps	Number of steps taken, single value per day
		Total calories	single value per day
		Active calories	single value per day
		Average MET	Average Metabolic equivalent of task, single value per day
		inactive MET minutes	single value per day
		low activity MET minutes	single value per day
		medium activity MET minutes	single value per day
		high activity MET minutes	single value per day
		inactive time	minutes spent not being active, single value per day
		low activity time	minutes spent with low activity (low MET level)
		medium activity time	minutes spent with medium intensity activity
		high activity time	minutes spent with high intensity activity
	Sleep	Total time in bed	Includes time being awake, single value per day given in seconds
		Total sleep time	Total time asleep (excludes delay to falling asleep and awake time during night)
		Time awake	Time awake during the night
		REM sleep	seconds spent in REM sleep
		Deep sleep	seconds spent in deep sleep
		light sleep	seconds spend in light sleep
		Time restless	Time spent moving during sleep
		Sleep latency	Time difference between going to bed & falling asleep
	"Recovery" / Rest	Resting heart rate	lowest nightly heart rate
		Average sleep heart rate	average heart rate measured during sleep
		Heart Rate Variability	highest HRV measured during sleep
		Body temperature delta	Nightly body temperature compared to long-term baseline
Apple Watch	Activity	Cycling distance	Total distance cycled during a day (given in km)
Apple Watch	Activity	Walking + Running distance	Total distance walked / run during a day (given in km)
Fitbit	Body	Weight	Daily weight in kilogram (averaged if more than one measurement per day)
RescueTime	Productivity	Very distracting time	Total amount of time spent using apps classified as "very distracting"
		Distracting time	Total amount of time spent using apps classified as "distracting"
		Neutral time	Total amount of time spent using apps classified as "neutral"
		Productive time	Total amount of time spent using apps classified as "productive"
		Very productive time	Total amount of time spent using apps classified as "very productive"

The data was exported from the respective sources through the Open Humans integrations. A Jupyter notebook to export all this data in a unified spreadsheet will be made available soon.

Processing the data

I exported data for all these variables for a time period between September 1, 2021 and June 08, 2022 as this was the period for which I felt like most data would be complete. Following the export of the data as one large spreadsheet, some more processing was needed.

Doing a PCA ideally requires a "complete" data set without any missing values. Depending on the metric, the spreadsheet generated above still had gaps in it. Some gaps were due to lack of measurements (e.g. a gap in the weight record represents me not weighing myself), while in other cases a gap means that the value should be zero (e.g. if I did not cycle at all, then Apple Health would report a data gap, but it actually represents zero kilometers cycled). To fill the table, I performed a linear interpolation of my weight for missing days, and set missing values to zero for all RescueTime entries as well as for missing cycle distance values (those categories were the only ones affected).

Running a PCA

The variable distribution after the PCA, answering how the 37 different variables correlate with each other. Arrows pointing in the same direction positively correlate with each other. Arrows pointing in opposite directions are negatively correlated. Length of the arrows is a metric for how 'well' the variable is represented in the PCA. Colors are the result of kmeans clustering of variables.

With the full data table prepared for this time period, I ended up with 280 observations (aka days) that had full data for these 37 variables that I could use to run the PCA. For this I used the R package FactoMineR as it not only provides the basic functions for running the analysis, but also a wide set of visualization options. Roughly speaking, PCAs are a way to reduce the dimensionality of data by 'rotating' the data in a way that it can be represented in fewer dimensions, ideally no more than 2-3 as this would allow visualizing it in a human-readable space. In this case, we have 37 different dimensions as given by the 37 variables and would like to boil it down to fewer dimensions without losing any information.

How do the different metrics correlate?

Running the PCA – including a normalization/re-scaling of the variables – results in the graph on the right. Doing an additional clustering by kmeans shows that there are three main groups in which the variables can be clustered:

The top left quadrant mainly includes all metrics associated to productivity as measured by RescueTime (regardless of productivity/unproductivity category), as well as different metrics from Oura that relate to inactivity but also my cycling distance.
The bottom right quadrant includes mainly different sleep metrics from Oura but also associated metrics such as resting heart rate and average sleeping heart rate and furthermore my weight.
The top right quadrant includes metrics that have a 90º vector to both other clusters and mainly includes different metrics to medium & higher intensity activity. These include my overall step count as well as active calorie burn.

The axis-labels also show how much of the overall variance in my data can be explained among these two dimensions that are being plotted, which comes down to 18.5% of variance on the X-axis (dimension 1) and 13.8% on the Y-axis (dimension 2).

Linked content on this wiki

(The content in the table below is automatically created. See Template:Project Queries for details. If newly linked pages do not appear here, click on "More" and "Refresh".)

Project that build on this project

We talked about this project in the following meetings
2022-06-09 Self-Research Chat

@@ Line 8: / Line 8: @@
 During this discussion a suggestion was to see whether unsupervised clustering could help uncover which variables correlate with each other while also highlighting whether there are different types of days. A search in the Show & Tell archives showed that a similar approach had already been tried in the [[100 Days of Summer]] project.
-== Reducing dimensions with a Principal Component Analysis ==
+== Preparing data for clustering with Principal Component Analysis ==
-To give it a first try, I decided to go ahead and use some of my data to see if such a clustering could work. In order to limit the scope I decided to use data from a variety of sources. To simplify the approach, I decided to use the '''''day''''' as the unit of observation. For this, I either summed up or averaged measurements throughout the day, depending on the metric (see Table below).
+To give it a first try, I decided to go ahead and use some of my data to see if such a clustering could work using a simple [[principle component analysis]] (PCA). In order to limit the scope I decided to use data from a variety of sources. To simplify the approach, I decided to use the '''''day''''' as the unit of observation. For this, I either summed up or averaged measurements throughout the day, depending on the metric (see Table below).
 === Metrics used ===
@@ Line 134: / Line 134: @@
 I exported data for all these variables for a time period between September 1, 2021 and June 08, 2022 as this was the period for which I felt like most data would be complete. Following the export of the data as one large spreadsheet, some more processing was needed.
-Doing a [[principal component analysis]] ideally requires a "complete" data set without any missing values. Depending on the metric, the spreadsheet generated above still had gaps in it. Some gaps were due to lack of measurements (e.g. a gap in the weight record represents me not weighing myself), while in other cases a gap means that the value should be zero (e.g. if I did not cycle at all, then Apple Health would report a data gap, but it actually represents zero kilometers cycled).
+Doing a PCA ideally requires a "complete" data set without any missing values. Depending on the metric, the spreadsheet generated above still had gaps in it. Some gaps were due to lack of measurements (e.g. a gap in the weight record represents me not weighing myself), while in other cases a gap means that the value should be zero (e.g. if I did not cycle at all, then Apple Health would report a data gap, but it actually represents zero kilometers cycled). To fill the table, I performed a linear interpolation of my weight for missing days, and set missing values to zero for all RescueTime entries as well as for missing cycle distance values (those categories were the only ones affected).
+== Running a PCA ==
+[[File:PCA test variable alignment.png|thumb|The variable distribution after the PCA, answering how the 37 different variables correlate with each other. Arrows pointing in the same direction positively correlate with each other. Arrows pointing in opposite directions are negatively correlated. Length of the arrows is a metric for how 'well' the variable is represented in the PCA. Colors are the result of kmeans clustering of variables.]]
+With the full data table prepared for this time period, I ended up with 280 observations (aka days) that had full data for these 37 variables that I could use to run the PCA. For this I used the [[R]] package <code>FactoMineR</code> as it not only provides the basic functions for running the analysis, but also a wide set of visualization options. Roughly speaking, PCAs are a way to reduce the dimensionality of data by 'rotating' the data in a way that it can be represented in fewer dimensions, ideally no more than 2-3 as this would allow visualizing it in a human-readable space. In this case, we have 37 different dimensions as given by the 37 variables and would like to boil it down to fewer dimensions without losing any information.
+=== How do the different metrics correlate? ===
+Running the PCA – including a normalization/re-scaling of the variables – results in the graph on the right. Doing an additional clustering by kmeans shows that there are three main groups in which the variables can be clustered:
+# The '''top left quadrant''' mainly includes all metrics associated to productivity as measured by RescueTime (regardless of productivity/unproductivity category), as well as different metrics from Oura that relate to inactivity but also my cycling distance.
+# The '''bottom right quadrant''' includes mainly different sleep metrics from Oura but also associated metrics such as resting heart rate and average sleeping heart rate and furthermore my weight.
+# The '''top right quadrant''' includes metrics that have a 90º vector to both other clusters and mainly includes different metrics to medium & higher intensity activity. These include my overall step count as well as active calorie burn.
+The axis-labels also show how much of the overall variance in my data can be explained among these two dimensions that are being plotted, which comes down to 18.5% of variance on the X-axis (dimension 1) and 13.8% on the Y-axis (dimension 2).
 {{Project Queries}}
 [[Category:Projects]]