Data format structure variables

Topic Infobox

Linked pages on this wiki

This page is mainly for DIYers with at least some knowledge of Excel or other statistical analysis software. Previous steps would be choosing a Tool or Aggregator. Some tools give data straight to the user. This page is the set of steps after receiving data. Check if the device or app produces correct data soon after first use.

File FormatEdit

Comma Separated Values (CSV) is the easiest for beginners to work with. Most analysis tools can read JSON and sqlite without problems. Avoid proprietary formats which some software may be unable to open. If you are using R, readr package will automatically point out errors in CSV file.

StructureEdit

Statistical analysis of self tracking data is usually done on tabular data, like spreadsheets, with rows representing individual observations.^[1]^[2] In all but a few cases this is sufficient structure. See also Dates and Times.

Time tracking and events. For example consider tracking the muscle spasms of left arm.

What. Necessary if variable description does not imply only one value.
When. Always the time of the event is included.
Duration. How long event lasted.
Strength. How strong was the event. See Self assessment.
Notes. A written description of anything unusual about the event. like journaling and noting but with a specific connection to this event that is easier for analysis tool to use.
Additional / advanced. For example a true or false of weather the spasms hurt. Some of these can be tests, others can be really complicated like with Diet tracking apps. Sometimes this type of data can not easily be represented in tabular form.

States that are written once and apply until changed to something else. For example, place of residence or whether a brace is being worn continuously. This structure is similar to a simple "event" with just a 'when-time' and 'what' though 'duration' is calculated based on when the state is changed to something new.

Continuous sensors, usually wearables, will produce a time series with regular intervals. This will be when and strength. Something similar is produced by Tools to survey symptoms though the frequency is irregular and far less.

Journal entries and notes. Often journal entries are written texts describing the day.

VariablesEdit

Some common characteristics of sources of data you should record before analyzing.

Independence. Is the variable affected by other variables your are measuring or is it almost completely dependent on outside factors like the weather?
A variable that depends on previous values of this same variable is not independent and is called auto-correlative? and non-stationary. For example skills at playing the guitar.
Randomness of Missingness. Similar to independence but its not the value of the variable but whether other measured variables could correlate with higher incidence of missing values. For example forgetting to charge the smart band because of tiredness and having a night without it on.
Target. Level. Is this variable something you want to improve, or a variable likely to affect those, or just an intermediary background variable measured because it was easy and provided context? If this is a target variable, mention the purpose of of tracking such as Life extension, your doctor told you based on Lab tests, or are you trying to improve performance Sports.
Similarity. Proxy. Is this variable measuring something very similar to what another variable is measuring. The most common example is heart rate as many wearable measure it and the avid self tracker always has a few.
Sign. Positivity. If variable is a target, are higher values better or the opposite. Sometimes some middle value is best like with BMI.
Scale and fact of Self assessment. Whether variable is anchored to objective standard or subjective or even relative to previous measurement. Also mention that it is self assessment.
Is this target variable a measure of a problem, like pain, an accomplishment like playing guitar better, or both like a scale of cleverness in conversation?

Data CleaningEdit

Correct, remove or impute outliers (very extreme values) produced by errors but not real events. In the rare case that the data is raw sensor like Accelerometry, aggregate it into something more manageable. Consumer wearables make "steps per 10 minutes" for which open source script is likely available. Finally, compare against other data to remove errors like exercising in the middle of sleep. I have not seen a script for this yet. DG (talk)

ReferencesEdit

[1] ttps://r4ds.had.co.nz/tidy-data.html

[2] ttps://en.wikipedia.org/wiki/Relational_database

[1]

[2]