Difference between revisions of "Data format structure variables"

From Personal Science Wiki
Jump to navigation Jump to search
m
Line 5: Line 5:
  
 
==== Structure ====
 
==== Structure ====
Statistical analysis of self tracking data is usually done on tabular data, like spreadsheets, with rows representing individual observations.<ref>https://en.wikipedia.org/wiki/Relational_database</ref> In all but a few cases this is sufficient structure.  
+
Statistical analysis of self tracking data is usually done on tabular data, like spreadsheets, with rows representing individual observations.<ref>https://r4ds.had.co.nz/tidy-data.html</ref><ref>https://en.wikipedia.org/wiki/Relational_database</ref> In all but a few cases this is sufficient structure.  
  
  

Revision as of 02:38, 29 May 2022

This page is mainly for DIYers with at least some knowledge of Excel or other spreadsheet software.

File Format

Comma Separated Values (CSV) is the easiest for beginners to work with. Most analysis tools can read JSON and sqlite without problems. Avoid proprietary formats which some software may be unable to open. If you are using R, readr package will automatically point out errors in CSV file.

Structure

Statistical analysis of self tracking data is usually done on tabular data, like spreadsheets, with rows representing individual observations.[1][2] In all but a few cases this is sufficient structure.


Time tracking and events. For example consider tracking the muscle spasms of left arm.

  • What. Necessary if variable description does not imply only one value.
  • When. Always the time of the event is included.
  • Duration. How long event lasted.
  • Strength. How strong was the event. See Self assessment.
  • Notes. A written description of anything unusual about the event. like journaling and noting but with a specific connection to this event that is easier for analysis tool to use.
  • Additional / advanced. For example a true or false of weather the spasms hurt. Some of these can be tests, others can be really complicated like with Diet tracking apps. Sometimes this type of data can not easily be represented in tabular form.


States that are written once and apply until changed to something else. For example, place of residence or whether a brace is being worn continuously. This structure is similar to a simple "event" with just a when-time and what though duration is calculated from replacement.


Continuous sensors, usually wearables, will produce a time series with regular intervals. This will be when and strength. Something similar is produced by Tools to survey symptoms though the frequency is irregular and far less.


Journal entries and notes. Often journal entries are written texts describing the day.

Variables

Some common characteristics of sources of data you should record before analyzing.

  • Independence. Is the variable affected by other variables your are measuring or is it almost completely dependent on outside factors like the weather?
  • A variable that depends on previous values of this same variable is not independent and is called auto-correlative? and non-stationary. For example skills at playing the guitar.
  • Randomness of Missingness. Similar to independence but its not the value of the variable but whether other measured variables could correlate with higher incidence of missing values. For example forgetting to charge the smart band because of tiredness and having a night without it on.
  • Target. Level. Is this variable something you want to improve, or a variable likely to affect those or just an intermediary background variable measured because it was easy and provided context?
  • Similarity. Proxy. Is this variable measuring something very similar to what another variable is measuring. The most common example is heart rate as many wearable measure it and the avid self tracker always has a few.

Data Cleaning

Check if the device or app produces correct data soon after first use. Correct, remove or impute outliers (very extreme values) produced by errors but not real events. In the rare case that the data is raw sensor like Accelerometry, aggregate it into something more manageable. Consumer wearables make "steps per 10 minutes" for which open source script is likely available. Finally, compare against other data to remove errors like exercising in the middle of sleep. I have not seen a script for this yet. DG (talk)

References

Linked content on this wiki

(The content in the table below is automatically created. See Template:Topic Queries for details. If newly linked pages do not appear here, click on "More" and "Refresh".)

Tools related to this topic  
Projects related to this topic  
Self researchers related to this topic  
We talked about this topic in the following meetings