Difference between revisions of "Finding relations between variables in time series"

Latest revision as of 18:16, 9 June 2024

Topic Infobox

Linked pages on this wiki

Most personal science projects require finding relationships between different variables of the type 'time series'^[1]. An example could be the question "does my daily chocolate consumption correlate with my daily focus score?".

You could do experiments if you control everything rigidly or if the effects are strong and quick, like less than a week. Old data may be useable as Baseline and a baseline may rule out some issues. If both block (like 2 weeks) and daily mixed (random intervention every day) produce the same results then issues of time series are probably not in your experiment.

Finding more complicated relationships require better statistical tests and algorithms and data science skills. Apps that would do this automatically or at least easily are not yet available. See below. Most internet resources treat time series as (regular cyclical) series, which is not useful as most of the tracked variables have irregular patterns and don't even have a regularly cyclical component.

To do anything mentioned above you need to have your data parsed, cleaned, all in one place, and ideally even visualized.

List of less technical toolsEdit

There are also a number of tools or apps that can semi-automatically perform these correlations and help in doing the analyses.

Open Humans and their Personal Analysis notebooksEdit

Open Humans provides a library of notebooks that can be used to visualize data across data sources and find relations between different variables. It also supports the upload of generic data files through the File Uploader.

ZenobaseEdit

Zenobase can test correlations based on user-specified questions. User must configure lag, regression method and aggregation method using a UI. Powerful filtering tools too.

CuredaoEdit

Curedao. Correlation over bins and lags selecting the biggest effect.

Data FlexorEdit

DataFlexor Lots of pretty pictures. Not super advanced statistics yet.

HALE.lifeEdit

"Baysian nodes, 'do' semantics, AI and experts"^[2] Right now only for sports teams.

Exist.io ^[3]Edit

From the Exist.io main site :"Which habits go together? Correlations are the most powerful part of Exist. By combining your data, we can answer questions like: “What makes me happiest?”, “What can I do to be more active?”, “When am I most productive?”"

HabitdashEdit

Habitdash's Automatic data analysis searches for hidden patterns to find relationships between activity, sleep, weight and other habits.

Optimized appEdit

Optimized claims to do "automatic correlation mining"

LytikoEdit

lytiko.com promises correlations connections deep insights and visualizations.

VitalEdit

Vital (https://tryvital.io). API for health and fitness data. Free to use API for collecting wearables and health data and standardising them into one API. You can also use Vital's API for delivering at-home test kits.

young.ai and aging.aiEdit

Deep learning predictor of age based on human blood tests, young.ai also makes recommendations.

Sonar sonarhealth.coEdit

Customizable aggregation and syncing like weigh fitbit twice as much as apple watch or average steps instead of sum.

tunum.healthEdit

pearson correlation, trend analysis and manual dichotomization

Wellness FX

Export from Apple Health^[4] (no analysis)

ConnectorDB DIY OS no analysis

Heedy DIY OS no analysis

Zapier, Integromat, IFTTT, DIY no analysis

List of very technical toolsEdit

Some people do all the data science by themselves, by using programming languages such as R and Python in notebooks or apps. Coding platforms such as the notebooks on Open Humans, Kaggle, rpubs, or GitHub can help. So can GUI like Python's Orange.

Programming languages for statistics; Matlab, R, Python, Julia.

Try Python GUI time series analysis .

DIY IndividualsEdit

Reasons time series analysis especially as applied to QS is hardEdit

Wavelet coherence is one potential solution.

Really strong relationships will be detected even through most of these problems.

Spurious Correlations mostly shows that if two things are trending in one direction and are checked for correlation they will show a very significant correlation. Practice effect is a subset. Another is one instance of an event type A increases the chances of the same event type happening soon after. Economists suggest unit root.

Effects on target variable from outside known variables. In non time series this is compensated for with RCT but in time series such an effect may last a while and coincide with an intervention causing very false results. This problem makes baseline data gathering more difficult and also necessary. Sometimes a baseline will show that this issue does not occur for a particular target variable. Alternatively experimenter could compensate by strictly controlling all possible sources of variance.

Lag. What if eating pizza on one day causes heartburn the next?

Build up. What if it takes two days of eating pizza to cause heartburn?

Rate of change. Trend. Opposite of build up; derivative instead of integral. Stopping or starting an all pizza diet causes heartburn.

Bin. Window. Smooth. Variables only make domain sense as aggregate over some time. Variables have a really high sampling rate.

Interpolate. Variables have different sampling rates so need to be interpolated to be compared.

Types of data. [Exercised] is an event with specific occurrence moment and length while [tired] is a vaguer value user could use to try to describe feelings past 4 hours.

All the Issues with Self Report .

Few positive instances but they are important. Went to a specific restaurant twice got sick soon after twice. Only ever got sick with similar symptoms five times. Or. Two large rare humps happen almost one after the other, similar to previous example if treated as events, adding the fact that lots of samples showing their similarity in shape too.

Since removing real effects of other variables on target variable makes the variable of interest's effect stand out, 'machine learning' needs to be used. Basic approach would be to bin predictor variables multiple ways based on time from effect being checked, mean or other aggregator method and window of the aggregator.

Machine learning also has limits on the kind of patterns it can detect.

What to expect from the complete analysis toolEdit

User without experience in statistical analysis will not be able to tell the difference between correctly computed correlations and poorly computed ones. However, a genuinely complete analysis produces plots which should include at least some of the following:

Interpolation for irregular time series.

Change point or breakpoint detection.

Outlier detection. Smoothing.

Removal of effect of variables found to correlate with this one to show residuals.

Cycles decomposition using a model like ARIMA. Ex. kayak season is in the summer or lunch is at exactly 1pm.

Detection of repeated shapes implying similar events that are not cyclical; like dinner is anywhere between 4pm and 10pm and causes a particular 2 hour spike in glucose. Maybe found solution to this^[5] and more in Matrix Profile.^[6]Do read the '100 questions'.^[7]

ReferencesEdit

↑ Core-Guide_Longitudinal-Data-Analysis_10-05-17.pdf (duke.edu)
↑ https://www.hale.life/
↑ https://github.com/ejain/n-of-1-ml
↑ github.com/Lybron/health-auto-export
↑ link.springer.com/article/10.1007/s10618-010-0179-5
↑ cran.r-project.org/web/packages/tsmp/vignettes/press.html
↑ www.cs.ucr.edu/~eamonn/100_Time_Series_Data_Mining_Questions__with_Answers.pdf

[1] Core-Guide_Longitudinal-Data-Analysis_10-05-17.pdf (duke.edu)

[2] ttps://www.hale.life/

[3] ttps://github.com/ejain/n-of-1-ml

[4] thub.com/Lybron/health-auto-export

[5] .springer.com/article/10.1007/s10618-010-0179-5

[6] ran.r-project.org/web/packages/tsmp/vignettes/press.html

[7] www.cs.ucr.edu/~eamonn/100_Time_Series_Data_Mining_Questions__with_Answers.pdf

[1]

[2]

[3]

[4]

[5]

[6]

[7]

@@ Line 1: / Line 1: @@
 {{Topic Infobox}}
-A frequent need when engaging in personal science is finding relationships between different variables across a time series<ref>Core-Guide_Longitudinal-Data-Analysis_10-05-17.pdf (duke.edu)</ref>, an example could be the question "does eating chocolate improve focus?".
+Most personal science projects require finding relationships between different variables of the type 'time series'<ref>Core-Guide_Longitudinal-Data-Analysis_10-05-17.pdf (duke.edu)</ref>. An example could be the question "does my daily chocolate consumption correlate with my daily focus score?".
-== How does one find relations between variables ==
+You could do experiments if you control everything rigidly or if the effects are strong and quick, like less than a week. Old data may be useable as Baseline and a baseline may rule out some issues. If both block (like 2 weeks) and daily mixed (random intervention every day)  produce the same results then issues of time series are probably not in your experiment.
-To do this you need to have your data parsed, cleaned, all in one place, and ideally even visualized. The next step is to find relations between variables. Some people do this by themselves, by using programming languages such as R and Python in notebooks or apps. Coding platforms such as the notebooks on [[Open Humans]], Kaggle or GitHub can help, but it frequently
-requires technical skills.
-There are also a number of tools or apps that can semi-automatically perform these correlations and help in doing the analyses.
+Finding more complicated relationships require better statistical tests and algorithms and data science skills. Apps that would do this automatically or at least easily are not yet available. See below. Most internet resources treat time series as (regular cyclical) series, which is not useful as most of the tracked variables have irregular patterns and don't even have a regularly cyclical component.
+To do anything mentioned above you need to have your data parsed, cleaned, all in one place, and ideally even visualized.
 == List of less technical tools ==
+There are also a number of tools or apps that can semi-automatically perform these correlations and help in doing the analyses.
 ==== [[Open Humans]] and their Personal Analysis notebooks ====
@@ Line 15: / Line 16: @@
 ==== [[Zenobase]] ====
 [https://blog.zenobase.com/post/81497604762 Zenobase] can test correlations based on user-specified questions. User must configure lag, regression method and aggregation method using a UI. Powerful filtering tools too.
+==== Curedao ====
+[https://github.com/curedao/decentralized-fda Curedao]. Correlation over bins and lags selecting the biggest effect.
 ==== Data Flexor ====
@@ Line 39: / Line 43: @@
 ==== young.ai and [http://www.aging.ai/ aging.ai] ====
-deep learning predictor of age based on human blood tests and young.ai makes recommendations
+Deep learning predictor of age based on human blood tests, young.ai also makes recommendations.
+==== Sonar [https://www.sonarhealth.co sonarhealth.co] ====
+Customizable aggregation and syncing like weigh fitbit twice as much as apple watch or average steps instead of sum.
+====== tunum.health ======
+pearson correlation, trend analysis and manual dichotomization
 [[Gyroscope]]
@@ Line 54: / Line 64: @@
 Wellness FX
+Export from Apple Health<ref>github.com/Lybron/health-auto-export</ref> (no analysis)
+ConnectorDB DIY OS no analysis
+Heedy DIY OS no analysis
+Zapier, Integromat, IFTTT, DIY no analysis
 == List of very technical tools ==
+Some people do all the data science by themselves, by using programming languages such as R and Python in notebooks or apps. Coding platforms such as the notebooks on [[Open Humans]], Kaggle, rpubs, or GitHub can help. So can GUI like Python's Orange.
 Programming languages for statistics; Matlab, R, Python, Julia.
@@ Line 62: / Line 82: @@
 ==== [[List of Interesting Self-Tracking Results#Observational.2C%20Many%20variables|DIY Individuals]] ====
-Some people allow people to use their scripts that analyze lots of data at once but this does require some programming skill.
 == Reasons time series analysis especially as applied to QS is hard ==
 [https://forum.quantifiedself.com/t/my-baseline-network-physiology-10-days-of-eeg-egg-ekg-cgm-temperature-activity-and-food-logs/5671/19 Wavelet coherence] is one potential solution.
-[http://www.tylervigen.com/spurious-correlations Spurious Correlations] mostly shows that if two things are trending in one direction and are checked for correlation they will show a very significant correlation. Practice effect is a subset. Another is one instance of an event increases the chances of the same event happening soon after. Economists suggest unit root.
+Really strong relationships will be detected even through most of these problems.
+[http://www.tylervigen.com/spurious-correlations Spurious Correlations] mostly shows that if two things are trending in one direction and are checked for correlation they will show a very significant correlation. Practice effect is a subset. Another is one instance of an event type A increases the chances of the same event type happening soon after. Economists suggest unit root.
+Effects on target variable from outside known variables. In non time series this is compensated for with RCT but in time series such an effect may last a while and coincide with an intervention causing very false results. This problem makes baseline data gathering more difficult and also necessary. Sometimes a baseline will show that this issue does not occur for a particular target variable. Alternatively experimenter could compensate by strictly controlling all possible sources of variance.
 Lag. What if eating pizza on one day causes heartburn the next?
+Build up. What if it takes two days of eating pizza to cause heartburn?
+Rate of change. Trend. Opposite of build up; derivative instead of integral. Stopping or starting an all pizza diet causes heartburn.
+Bin. Window. Smooth. Variables only make domain sense as aggregate over some time. Variables have a really high sampling rate.
+Interpolate. Variables have different sampling rates so need to be interpolated to be compared.
+Types of data. [Exercised] is an event with specific occurrence moment and length while [tired] is a vaguer value user could use to try to describe feelings past 4 hours.
+All the [[Issues with Self Report]] .
 Few positive instances but they are important. Went to a specific restaurant twice got sick soon after twice. Only ever got sick with similar symptoms five times. Or. Two large rare humps happen almost one after the other, similar to previous example if treated as events, adding the fact that lots of samples showing their similarity in shape too.
-Different sampling rates need to be interpolated to be compared. Window. Since removing the effects of other variables makes the variable of interest's effect stand out, machine learning must be used. Common approach would be to bin predictor variables multiple ways based on time from effect being checked, mean or other aggregator method and window of the aggregator.
+Since removing real effects of other variables on target variable makes the variable of interest's effect stand out, 'machine learning' needs to be used. Basic approach would be to bin predictor variables multiple ways based on time from effect being checked, mean or other aggregator method and window of the aggregator.
+Machine learning also has limits on the kind of patterns it can detect.
+=== What to expect from the complete analysis tool ===
+User without experience in statistical analysis will not be able to tell the difference between correctly computed correlations and poorly computed ones. However, a genuinely complete analysis produces plots which should include at least some of the following:
+Interpolation for irregular time series.
-Machine learning also has limits on the kind of patters it can detect.
+Change point or breakpoint detection.
-Types of data. [Exercised] is an event with specific occurrence moment and length while [tired] is a vaguer value user could use to try to describe feelings past 4 hours.
+Outlier detection. Smoothing.
-=== What to expect from the complete analysis tool ===
+Removal of effect of variables found to correlate with this one to show residuals.
-The difference between an app finding a relationship between two variables through correct and incorrect means maybe very difficult to detect. However, the graphs it produces should include most of the following: Interpolation for irregular time series. Change point or breakpoint detection. Outlier detection. Smoothing. Cycles decomposition using a model like ARIMA. Ex. kayak season is in the summer or lunch is at exactly 1pm. Not consistently timed event shape detection like dinner is anywhere between 4 and 10pm and causes a particular 2 hour spike in glucose (I know this is a bad example but ...). Removal of effect of variables found to correlate with this one to show residuals.
+Cycles decomposition using a model like ARIMA. Ex. kayak season is in the summer or lunch is at exactly 1pm.
+Detection of repeated shapes implying similar events that are not cyclical; like dinner is anywhere between 4pm and 10pm and causes a particular 2 hour spike in glucose. Maybe found solution to this<ref>link.springer.com/article/10.1007/s10618-010-0179-5</ref> and more in Matrix Profile.<ref>cran.r-project.org/web/packages/tsmp/vignettes/press.html</ref>Do read the '100 questions'.<ref>www.cs.ucr.edu/~eamonn/100_Time_Series_Data_Mining_Questions__with_Answers.pdf</ref>
+== References ==
 [[Category:Data analysis]]