Describing datasets
Below are two examples of datasets with different organizations and issues which demonstrate the capabilities of DatasetManager.
Well organized dataset with minimal issues
Consider a dataset organized as follows:
๐ genpath
โ ๐ Visual3D
โ โ ๐ Subject 1
โ โ โ ๐ export
โ โ โ โ ๐ park
โ โ โ โ park-none.mat
โ โ โ โ park-norm.mat
โ โ โ โ park-excess.mat
โ โ โ ๐ import
โ โ ๐ Subject 2
โ โ
โ
โ ๐ DFlow
โ ๐ Subject 1
โ โ park-none.csv
โ โ park-norm.csv
โ โ park-excess.csv
โ โ
โ ๐ Subject 2
โ
๐ rawpath
โ ๐ Subject 1
โ โ ๐ _
โ โ park-none.c3d
โ โ park-norm.c3d
โ โ park-excess.c3d
โ โ
โ ๐ Subject 2
โ
The dataset is organized into 3 separate folders, but all the trials use the same naming scheme between the different folders. Therefore, we can group the data into 3 different data subsets (genpath/Visual3D
, genpath/DFlow
, and rawpath
) for this analysis based on their location and filetype. Each DataSubset
gets a name, a source type, a parent directory, and a glob which describes the structure and location, and possibly more (eg extension), of the files specified by the DataSubset
.
Julia code
genpath = "path/to/one/subset"
dflowpath = "path/to/another/subset"
parksubsets = [
DataSubset("visual3d", V3DExportSource, joinpath(genpath, "Visual3D"), "Subject [0-9]*/export/park/park-*.mat"),
DataSubset("dflow", DFlowSource, joinpath(genpath, "DFlow"), "Subject [0-9]*/park-*.csv"),
DataSubset("vicon", C3DSource, rawpath, "Subject [0-9]*/_/park-*.c3d")
]
MATLAB code
genpath = 'path/to/one/subset'
dflowpath = 'path/to/another/subset'
parksubsets = [
DataSubset('visual3d', 'V3DExportSource', fullfile(genpath, 'Visual3D/Subject */export/park/park-*.mat')),
DataSubset('dflow', 'DFlowSource', fullfile(genpath, 'DFlow/Subject */park-*.csv')),
DataSubset('vicon', 'C3DSource', fullfile(rawpath, 'Subject */_/park-*.c3d'))
]
The MATLAB globbing syntax only supports asterisks. More info here.
This dataset only has one condition (aka 'factor' in statistical contexts) with three levels. The dataset was created with different terms for 2 of the levels, and we also wish to improve the naming of some of the levels. Any trial with "none"
in the path will be recognized as a "held"
trial. If a trial happens to already have the new terminology ("held"
), it will be recognized as a "held"
trial. The "norm"
condition is left unchanged, and will only match trials with "norm"
in the path.
Julia code
levels = Dict(:arms => ["none" => "held", "norm", "excess" => "active"])
parkconds = TrialConditions((:arms,), levels)
MATLAB code
levels.arms(1).from = 'none'
levels.arms(1).to = 'held'
levels.arms(2).to = 'norm'
levels.arms(3).from = 'excess'
levels.arms(3).to = 'active'
parkconds = TrialConditions.generate({'arms'}, levels)
% alternately:
parkconds = TrialConditions.generate(fieldnames(levels), levels)
The findtrials
function will search every DataSubset
for trials which match the TrialConditions
:
Julia code
# Read all perturbations
parktrials = findtrials(parksubsets, parkconds)
MATLAB code
parktrials = DataSet.findtrials(parksubsets, parkconds)
In some cases, there are duplicate (e.g. a trial was redone due to technical difficulties, etc) or unwanted (e.g. corrupted data, etc) files that will match the same set of conditions in a particular DataSubset
, and the findtrials
function will be unable to determine which file should be used for that DataSubset
source. Suppose the first of attempt for a trial, "Subject 01/_/park-norm.c3d"
had an issue, and it was repeated with a '-02'
added after the trial name ("Subject 01/_/park-norm-02.c3d"
).
julia> parktrials = findtrials(parksubsets, parkconds)
ERROR: DuplicateSourceError: Found "vicon" source file "โฆ/Subject 01/_/park-norm-02.c3d" for
Trial(1, "park-norm", Dict{Symbol,Any}(:arms => "norm"), 3 sources) which already has
a "vicon" source at "โฆ/Subject 01/_/park-norm.c3d"
Stacktrace:
[1] findtrials(::Array{DataSubset,1}, ::TrialConditions; I::Type{T} where T, subject_fmt::Regex, ignorefiles::Array{String,1}, defaultconds::Nothing) at /home/user/.julia/dev/DatasetManager/src/trial.jl:232
[2] top-level scope at REPL[7]:1
This DuplicateSourceError
alerts you that, for Trial(1, "park-norm", Dict{Symbol,Any}(:arms => "norm"))
there are conflicting files for the "vicon"
source, and gives you the names of the two files. The solution is to add any duplicate or unwanted files to the ignorefiles
keyword argument (or the 'IgnoreFiles'
optional argument in MATLAB).
Julia:
# Read all perturbations
parktrials = findtrials(parksubsets, parkconds; ignorefiles=[
joinpath(rawpath, "Subject 01/_/park-norm-01.c3d")
])
MATLAB:
parktrials = DataSet.findtrials(parksubsets, parkconds, 'IgnoreFiles', { ...
fullfile(rawpath, 'Subject 01/_/park-norm-01.c3d')
})
Dataset with different naming schemes
Consider a different dataset, organized as follows:
๐ v3dpath
โ ๐ Subject 1
โ โ ๐ Export
โ โ โ 20181204_1400_NORMS_TR03.mat
โ โ โ 20181204_1400_NORMC_TR03.mat
โ โ โ 20181204_1400_NORM_PARK_TR03.mat
โ โ โ
โ โ ๐ import
โ ๐ Subject 2
โ โ ๐ Export
โ โ norm-singletask.mat
โ โ Norm-dualtask.mat
โ โ park-norm.mat
โ โ
โ
๐ dflowpath
โ ๐ N01
โ โ 20181204_1400_1448_AS_BA_NP_N01_TR01.txt
โ โ 20181204_1400_1501_AS_CO_NP_N01_TR01.txt
โ โ 20181204_1400_1646_NA_TR_NP_N05_TR01.txt
โ โ
โ ๐ N02
โ
This analysis only needs 2 DataSubsets
:
Julia code
v3dpath = "path/to/one/subset"
dflowpath = "path/to/another/subset"
parkdatafiles = [
DataSubset("visual3d", V3DExportSource, v3dpath, "Subject [0-9]*/Export/*.mat"),
DataSubset("dflow", RawDFlowPDSource, dflowpath, "N[0-9]*/*.txt")
]
MATLAB code
v3dpath = 'path/to/one/subset'
dflowpath = 'path/to/another/subset'
parkdatafiles = [
DataSubset('visual3d', 'V3DExportSource', fullfile(v3dpath, 'Subject */Export/*.mat')),
DataSubset('dflow', 'RawDFlowPDSource', fullfile(dflowpath, 'N*/*.txt'))
]
This dataset has several issues which make the level filters more complex and require the use of Regex to properly find the conditions.
- The
"visual3d"
subset isn't completely consistent in the naming. For example"Norm"
was sometimes used instead of"norm"
, and"dual"
was sometimes used instead of "dualtask"
. The"dflow"
subset used a completely different trial naming scheme."AS"
was used instead of"norm"
,"BA"
instead of"singletask"
, etc.
Such conversions can be dealt with simply. However, a slightly more complex issue is that the "singletask"
condition in the "visual3d"
subset is denoted by an "S"
following the "arms"
factor. Just matching an "S"
could match either the "S"
in "Subject"
or in "RS"
; we need to only match an "S"
that follows the "arms"
factor, which can be specified by a positive lookbehind group in Regex, like this: "(?<=NONE|NORM)S"
. A similar Regex can be used to deal with the "C"
for "dualtask"
.
A similar technique can be used to find the "TR"
denoting the "park"
condition, by using lookbehind and lookahead Regex groups. The naming scheme for the "dflow"
subset contains "TR"
for every trial ("20181204_1400_1646_AS_TR_NP_N05_TR01.txt"), unrelated to the "park"
condition. However, we notice that the "TR"
denoting the "park"
condition has underscores on either side; based on that observation, we can write a Regex for these requirements as "(?<=_)TR(?=_)"
.
Julia code
labels = Dict(:arms => [["NONE", "NA"] => "held", ["AS", "Norm", "NORM"] => "norm"],
:kind => [["(?<=NONE|NORM|held|norm)S", "BA", "single"] => "singletask",
["(?<=NONE|NORM|norm|held)C", "CO",, "CP", "dual"] => "dualtask",
"PO" => "pert", ["PARK", "(?<=_)TR(?=_)"] => "park"],
:pert_side => ["R(?=[ST]|slip|trip)" => "right", "L(?=[ST]|slip|trip)" => "left"],
:pert_type => ["NP" => "steadystate", "(?<=[RL]|right|left)T" => "trip", "(?<=[RL]|right|left)S" => "slip"])
conds = TrialConditions((:arms,:kind,:pert_side,:pert_type), labels; required=(:arms,:kind))
MATLAB code
labels.arms(1).from = {'NONE', 'NA'};
labels.arms(1).to = 'held';
labels.arms(2).from = {'AS', 'Norm', 'NORM'};
labels.arms(2).to = 'norm';
labels.kind(1).from = {'(?<=NONE|NORM|held|norm)S', 'BA', 'single'};
labels.kind(1).to = 'singletask';
labels.kind(2).from = {'(?<=NONE|NORM|norm|held)C', 'CO', 'CP', 'dual'};
labels.kind(2).to = 'dualtask';
labels.kind(3).from = 'PO';
labels.kind(3).to = 'pert';
labels.kind(4).from = {'park', '(?<=_)TR(?=_)'};
labels.kind(4).to = 'park' ;
labels.pert_side(1).from = 'R(?=[ST]|slip|trip)';
labels.pert_side(1).to = 'right';
labels.pert_side(2).from = 'L(?=[ST]|slip|trip)';
labels.pert_side(2).to = 'left';
labels.pert_type(1).from = 'NP';
labels.pert_type(1).to = 'steadystate';
labels.pert_type(2).from = '(?<=[RL]|right|left)T';
labels.pert_type(2).to = 'trip';
labels.pert_type(3).from = '(?<=[RL]|right|left)S';
labels.pert_type(3).to = 'slip';
conds = TrialConditions.generate({'arms','kind','pert_side','pert_type'}, labels, 'Required', {'arms', 'kind'})
These TrialConditions
also include the optional factors of pert_side
and pert_type
. When the required
('Required'
in MATLAB) keyword arg is not specified, it is assumed that all factors are required. In this case, the "visual3d"
subset only included the pert_side
and pert_type
levels for trials that included a perturbation.
As always, the findtrials
function will locate trials and sources within each subset which match the given conditions.
Julia code
# Read all perturbations
parktrials = findtrials(parkdatafiles, conds;
subject_fmt=r"(?<=Subject |N)(?<subject>\d+)", ignorefiles=[
joinpath(dflowpath, "N02/20181206_1500_1554_NA_BA_NP_N02_TR01.txt"),
joinpath(dflowpath, "N02/20181206_1500_1657_AS_CP_RT_N02_TR01.txt"),
โฎ
joinpath(dflowpath, "N020/20190509_1000_1113_NA_CP_RS_N020_TR01.txt"),
joinpath(dflowpath, "N020/20190509_1000_1153_AS_PO_LT_N020_TR01.txt")
], defaultconds=Dict(:pert_type => "steadystate"))
MATLAB code
% Read all perturbations
parktrials = DataSet.findtrials(parkdatafiles, conds, ...
'SubjectFormat', '(?<=Subject |N)(?<subject>\d+)', 'IgnoreFiles', { ...
fullfile(dflowpath, 'N02/20181206_1500_1554_NA_BA_NP_N02_TR01.txt'), ...
fullfile(dflowpath, 'N02/20181206_1500_1657_AS_CP_RT_N02_TR01.txt'), ...
โฎ
fullfile(dflowpath, 'N020/20190509_1000_1113_NA_CP_RS_N020_TR01.txt'), ...
fullfile(dflowpath, 'N020/20190509_1000_1153_AS_PO_LT_N020_TR01.txt') ...
}, 'DefaultConditions', containers.Map('pert_type', 'steadystate'))
The subject_fmt
Regex has been modified here to match the subject id between the different naming schemes of the two DataSubsets
.
The defaultconds
Dict can be particularly useful when some conditions are optional, and therefore may not exist in the file path, but are needed or desired in the Trial
s.