Title: | Accessing the Peekbank Database and working with Peekbank data |
---|---|
Description: | Collection of tools for working with peekbank, an open repository for developmental eye-tracking data. |
Authors: | Mika Braginsky [aut, cre], Kyle MacDonald [aut], Michael Frank [aut] |
Maintainer: | Mika Braginsky <[email protected]> |
License: | GPL-3 |
Version: | 0.2.3.3 |
Built: | 2025-03-04 06:19:38 UTC |
Source: | https://github.com/langcog/peekbankr |
Adds a relative cdi score indicating the percentage of total achievable points the subject got on each given measure
append_relative_cdi_scores(subjects_table)
append_relative_cdi_scores(subjects_table)
subjects_table |
a subjects table with unnested cdi data, needs columns "subject_id", "language", "instrument_type", "measure", "rawscore" |
the input table with an added "cdi_relative" column that contains the percentage of total points gained in the given administrations
cdi_data <- all_subjects %>% unnest(subject_aux_data) %>% filter(!is.na(cdi_responses)) %>% unnest(cdi_responses) %>% append_relative_cdi_scores()
cdi_data <- all_subjects %>% unnest(subject_aux_data) %>% filter(!is.na(cdi_responses)) %>% unnest(cdi_responses) %>% append_relative_cdi_scores()
Checks cdi data for inconsistencies, warns about them, and fixes them
cleanup_cdi_data(cdi_data)
cleanup_cdi_data(cdi_data)
cdi_data |
a subjects table with unnested cdi data, needs columns "subject_id", "language", "instrument_type", "age", "sex", "measure", "rawscore" |
a cleaned up version of the cdi data
clean_cdi_data <- all_subjects %>% unnest(subject_aux_data) %>% filter(!is.na(cdi_responses)) %>% unnest(cdi_responses) %>% peekbankr::cleanup_cdi_data()
clean_cdi_data <- all_subjects %>% unnest(subject_aux_data) %>% filter(!is.na(cdi_responses)) %>% unnest(cdi_responses) %>% peekbankr::cleanup_cdi_data()
Connect to Peekbank
connect_to_peekbank(db_version = "current", db_args = NULL, compress = TRUE)
connect_to_peekbank(db_version = "current", db_args = NULL, compress = TRUE)
db_version |
String of the name of database version to use |
db_args |
List with host, user, and password defined |
compress |
Flag to use compression protocol (defaults to TRUE) |
con A DBIConnection object for the peekbank database
con <- connect_to_peekbank(db_version = "current", db_args = NULL) DBI::dbDisconnect(con)
con <- connect_to_peekbank(db_version = "current", db_args = NULL) DBI::dbDisconnect(con)
Add AOIs to an xy dataframe
ds.add_aois(xy_joined)
ds.add_aois(xy_joined)
xy_joined |
dataframe containing processed xy timepoints with aoi region sets information |
dataframe with two added columns 'side' and 'aoi'. 'side' only contains "left" or "right" value 'aoi' indicates whether this xy timepoint is looking to "target" or "distractor"
Fetching the list of field names and requirements in each table according to the schema json file
ds.get_json_fields(table_type)
ds.get_json_fields(table_type)
table_type |
the type of dataframe, for the most updated table types specified by schema, please use functionds.list_ds_tables() |
the list of field names
## Not run: fields_json <-ds.get_json_fields(table_type = "aoi_timepoints") ## End(Not run)
## Not run: fields_json <-ds.get_json_fields(table_type = "aoi_timepoints") ## End(Not run)
parse json file from peekbank github into a dataframe
ds.get_peekjson()
ds.get_peekjson()
the organized dataframe from schema json file
## Not run: peekjson <-ds.get_peekjson() ## End(Not run)
## Not run: peekjson <-ds.get_peekjson() ## End(Not run)
Download peekbank processed dataset from OSF
ds.get_processed_data(lab_dataset_id, path = ".", osf_address = "pr6wu")
ds.get_processed_data(lab_dataset_id, path = ".", osf_address = "pr6wu")
lab_dataset_id |
Specific ID occurring in the file hierarchy of the relevant OSF repo. |
path |
Where you want it on your own machine. Will error if directory doesn't exist. |
osf_address |
pr6wu for peekbank. |
Download specific peekbank dataset from OSF
ds.get_raw_data(lab_dataset_id, path = ".", osf_address = "pr6wu")
ds.get_raw_data(lab_dataset_id, path = ".", osf_address = "pr6wu")
lab_dataset_id |
Specific ID occurring in the file hierarchy of the relevant OSF repo. |
path |
Where you want it on your own machine. Will error if directory doesn't exist. |
osf_address |
pr6wu for peekbank. |
Check if a certain table is required according to schema
ds.is_table_required(table_type, coding_methods)
ds.is_table_required(table_type, coding_methods)
table_type |
the type of dataframe, for the most updated table types specified by schema, please use functionds.list_ds_tables() |
coding_methods |
methods used in the experiment for coding gaze data, to get the list of current coding methods, please use function ds.list_coding_methods() |
A boolean value
## Not run: is_required <-ds.is_table_required(table_type = "xy_timepoints", coding_method = "manual gaze coding") ## End(Not run)
## Not run: is_required <-ds.is_table_required(table_type = "xy_timepoints", coding_method = "manual gaze coding") ## End(Not run)
Get the coding method list from json schema file
ds.list_coding_methods()
ds.list_coding_methods()
a list of strings indicating allowed coding methods
## Not run: coding_methods <-ds.list_coding_methods() ## End(Not run)
## Not run: coding_methods <-ds.list_coding_methods() ## End(Not run)
List the tables required based on coding method
ds.list_ds_tables(coding_methods = c("eyetracking"))
ds.list_ds_tables(coding_methods = c("eyetracking"))
coding_method |
a list of strings indicating the methods used in the experiment for coding gaze data, to get the list of current coding methods, please use functionds.list_coding_methods() |
a list of table types that are required based on input coding method
## Not run: table_list <-ds.list_ds_tables(coding_method = "manual gaze coding") ## End(Not run)
## Not run: table_list <-ds.list_ds_tables(coding_method = "manual gaze coding") ## End(Not run)
List current allowed language choices for db import
ds.list_language_choices()
ds.list_language_choices()
a list of strings containing all the allowed language codes based on json schema file
## Not run: language_list <-ds.list_language_choices() ## End(Not run)
## Not run: language_list <-ds.list_language_choices() ## End(Not run)
Function for mapping raw data columns to processed table columns
ds.map_columns(raw_data, raw_format, table_type)
ds.map_columns(raw_data, raw_format, table_type)
raw_data |
raw data frame |
raw_format |
source of the eye-tracking data, e.g. "tobii" |
table_type |
type of processed table, e.g. "xy_data" | "aoi_table" |
processed data frame with specified column names
## Not run: df_xy_data <-ds.map_columns(raw_data = raw_data, raw_format = "tobii", table_type = "xy_data") df_aoi_data <-ds.map_columns(raw_data = raw_data, raw_format = "tobii", table_type = "aoi_data") ## End(Not run)
## Not run: df_xy_data <-ds.map_columns(raw_data = raw_data, raw_format = "tobii", table_type = "xy_data") df_aoi_data <-ds.map_columns(raw_data = raw_data, raw_format = "tobii", table_type = "aoi_data") ## End(Not run)
sets the starting point of a given trial to be zero
ds.normalize_times(df_table)
ds.normalize_times(df_table)
df_table |
to-be-resampled dataframe with t, aoi/xy values, trial_id and administration_id |
df_out with resampled time, xy or aoi value rows
Put processed data for specific peekbank dataset on OSF
ds.put_processed_data(token, dataset_name, path = ".", osf_address = "pr6wu")
ds.put_processed_data(token, dataset_name, path = ".", osf_address = "pr6wu")
token |
personal access tokens for uploading to OSF |
dataset_name |
Specific dataset name occurring in the file hierarchy of the relevant OSF repo. |
path |
Where the data live on your own machine. |
osf_address |
pr6wu for peekbank. |
Resampling is done by the following steps:
ds.resample_times(df_table, table_type)
ds.resample_times(df_table, table_type)
df_table |
to-be-resampled dataframe with t, aoi/xy values, trial_id and administration_id |
table_type |
table name, can only be "aoi_timepoints" or "xy_timepoints" |
1. iterate through every trial for every administration
2. create desired timepoint sequence with equal spacing according to pre-specified SAMPLE_RATE parameter
3. use approxfun to interpolate given data points to align with desired timepoint sequence "constant" interpolation method is used for AOI timepoints; "linear" interpolation method is used for xy timepoints; for more details on approxfun, please see: https://stat.ethz.ch/R-manual/R-devel/library/stats/html/approxfun.html
4. after resampling, bind resampled dataframes back together and re-assign aoi_timepoint_id
df_out with resampled time, xy or aoi value rows
## Not run: dir_datasets <- "testdataset" # local datasets dir lab_dataset_id <- "pomper_saffran_2016" dir_csv <- file.path(dir_datasets, lab_dataset_id, "processed_data") table_type <- "aoi_timepoints" file_csv <- file.path(dir_csv, paste0(table_type, '.csv')) df_table <- utils::read.csv(file_csv) df_resampled <-ds.resample_times(df_table, table_type = "aoi_timepoints") ## End(Not run)
## Not run: dir_datasets <- "testdataset" # local datasets dir lab_dataset_id <- "pomper_saffran_2016" dir_csv <- file.path(dir_datasets, lab_dataset_id, "processed_data") table_type <- "aoi_timepoints" file_csv <- file.path(dir_csv, paste0(table_type, '.csv')) df_table <- utils::read.csv(file_csv) df_resampled <-ds.resample_times(df_table, table_type = "aoi_timepoints") ## End(Not run)
sets the starting point of a given trial to be zero
ds.rezero_times(df_table)
ds.rezero_times(df_table)
df_table |
to-be-resampled dataframe with t, aoi/xy values, trial_id and administration_id |
df_out with resampled time, xy or aoi value rows
check all csv files against database schema for database import
ds.validate_for_db_import( dir_csv, cdi_expected, file_ext = ".csv", is_null_field_required = TRUE )
ds.validate_for_db_import( dir_csv, cdi_expected, file_ext = ".csv", is_null_field_required = TRUE )
dir_csv |
the folder directory containing all the csv files, the path should end in "processed_data" |
cdi_expected |
specifies whether cdi_data is to be expected to be present in the imported data |
file_ext |
the default is ".csv" |
an empty string if all tables passed the validator; otherwise, the function returns a list of messages describing detailed issues that needs to be fixed
## Not run: msg_error_all <-ds.validate_for_db_import(dir_csv = "./processed_data") ## End(Not run)
## Not run: msg_error_all <-ds.validate_for_db_import(dir_csv = "./processed_data") ## End(Not run)
Check if a dataframe/table is compliant to peekbank json before database import
ds.validate_table( df_table, table_type, cdi_expected, dir_csv, is_null_field_required = TRUE )
ds.validate_table( df_table, table_type, cdi_expected, dir_csv, is_null_field_required = TRUE )
df_table |
the dataframe to be saved |
table_type |
the type of dataframe, for the most updated table types specified by schema, please use functionds.list_ds_tables() |
is_null_field_required |
by default is set to TRUE which means that all the columns in the json file are required; when user specifically sets this to FALSE, then the fields that are allowed null values are not required. |
an empty string when the input data frame is compliant with json specification, such as having all the required columns, primary key field has unique values, etc. Otherwise, the function returns a list of messages describing detailed issues that needs to be fixed
## Not run: is_valid <-ds.validate_table(df_table = df_table, table_type = "xy_data", cdi_expected = F, dir_csv = "../processed_data") ## End(Not run)
## Not run: is_valid <-ds.validate_table(df_table = df_table, table_type = "xy_data", cdi_expected = F, dir_csv = "../processed_data") ## End(Not run)
Check if within aoi_timepoints table, there is no duplication in all the administration_ids associated with each individual trial_id
ds.validate_trial_uniqueness_constraint(df_aoi_timepoints)
ds.validate_trial_uniqueness_constraint(df_aoi_timepoints)
df_table |
the aoi_timepoints dataframe |
cdi_expected |
specifies whether cdi_data is to be expected to be present in the imported data; only relevant for subjects table. We could consider creating a special table type, so that invalid combinations of table_type and cdi_expected cannot happen, but it does not break anything, so low priority |
an empty string when all the administration_ids are unique within each trial_id; Otherwise, the error message will be returned.
## Not run: is_valid <-ds.validate_table(df_table = df_table, table_type = "xy_data", cdi_expected = FALSE) ## End(Not run)
## Not run: is_valid <-ds.validate_table(df_table = df_table, table_type = "xy_data", cdi_expected = FALSE) ## End(Not run)
Get administrations
get_administrations( age = NULL, dataset_id = NULL, dataset_name = NULL, connection = NULL )
get_administrations( age = NULL, dataset_id = NULL, dataset_name = NULL, connection = NULL )
age |
A numeric vector of a single age or a min age and max age (inclusive), in months |
dataset_id |
An integer vector of one or more dataset ids |
dataset_name |
A character vector of one or more dataset names |
connection |
A connection to the peekbank database |
A 'tbl' of Administrations data, filtered down by supplied arguments. If 'connection' is supplied, the result remains a remote query, otherwise it is retrieved into a local tibble.
## Not run: get_administrations() get_administrations(age = c()) get_administrations(dataset_name = "pomper_saffran_2016") ## End(Not run)
## Not run: get_administrations() get_administrations(age = c()) get_administrations(dataset_name = "pomper_saffran_2016") ## End(Not run)
Get AOI region sets
get_aoi_region_sets(connection = NULL)
get_aoi_region_sets(connection = NULL)
connection |
A connection to the peekbank database |
A 'tbl' of AOI Region Sets data, filtered down by supplied arguments. If 'connection' is supplied, the result remains a remote query, otherwise it is retrieved into a local tibble.
## Not run: get_aoi_region_sets() ## End(Not run)
## Not run: get_aoi_region_sets() ## End(Not run)
Get AOI timepoints
get_aoi_timepoints( dataset_id = NULL, dataset_name = NULL, age = NULL, rle = TRUE, connection = NULL )
get_aoi_timepoints( dataset_id = NULL, dataset_name = NULL, age = NULL, rle = TRUE, connection = NULL )
dataset_id |
An integer vector of one or more dataset ids |
dataset_name |
A character vector of one or more dataset names |
age |
A numeric vector of a single age or a min age and max age (inclusive), in months |
rle |
Logical indicating whether to use RLE data representation or not |
connection |
A connection to the peekbank database |
A 'tbl' of AOI Timepoints data, filtered down by supplied arguments. If 'connection' is supplied, the result remains a remote query, otherwise it is retrieved into a local tibble.
## Not run: get_aoi_timepoints(dataset_name = "pomper_saffran_2016") ## End(Not run)
## Not run: get_aoi_timepoints(dataset_name = "pomper_saffran_2016") ## End(Not run)
Get datasets
get_datasets(connection = NULL)
get_datasets(connection = NULL)
connection |
A connection to the peekbank database |
A 'tbl' of Datasets data. If 'connection' is supplied, the result remains a remote query, otherwise it is retrieved into a local tibble.
## Not run: get_datasets() ## End(Not run)
## Not run: get_datasets() ## End(Not run)
Get information on database connection options
get_db_info()
get_db_info()
List of database info: host name, current version, supported versions, historical versions, username, password
get_db_info()
get_db_info()
Run a SQL Query script on the Peekbank database
get_sql_query(sql_query_string, connection = NULL)
get_sql_query(sql_query_string, connection = NULL)
sql_query_string |
A valid sql query string character |
connection |
A connection to the Peekbank database |
The database after calling the supplied SQL query
## Not run: get_sql_query("SELECT * FROM datasets") ## End(Not run)
## Not run: get_sql_query("SELECT * FROM datasets") ## End(Not run)
Get stimuli
get_stimuli(dataset_id = NULL, dataset_name = NULL, connection = NULL)
get_stimuli(dataset_id = NULL, dataset_name = NULL, connection = NULL)
dataset_id |
An integer vector of one or more dataset ids |
dataset_name |
A character vector of one or more dataset names |
connection |
A connection to the peekbank database |
A 'tbl' of Stimuli data, filtered down by supplied arguments. If 'connection' is supplied, the result remains a remote query, otherwise it is retrieved into a local tibble.
## Not run: get_stimuli() get_stimuli(dataset_name = "pomper_saffran_2016") ## End(Not run)
## Not run: get_stimuli() get_stimuli(dataset_name = "pomper_saffran_2016") ## End(Not run)
Get subjects
get_subjects(connection = NULL)
get_subjects(connection = NULL)
connection |
A connection to the peekbank database |
A 'tbl' of Subjects data. Note that Subjects is a table used to link longitudinal Administrations, which is the primary table you probably want. If 'connection' is supplied, the result remains a remote query, otherwise it is retrieved into a local tibble.
## Not run: get_subjects() ## End(Not run)
## Not run: get_subjects() ## End(Not run)
Get trial types
get_trial_types(dataset_id = NULL, dataset_name = NULL, connection = NULL)
get_trial_types(dataset_id = NULL, dataset_name = NULL, connection = NULL)
dataset_id |
An integer vector of one or more dataset ids |
dataset_name |
A character vector of one or more dataset names |
connection |
A connection to the peekbank database |
A 'tbl' of Trial Types data, filtered down by supplied arguments. If 'connection' is supplied, the result remains a remote query, otherwise it is retrieved into a local tibble.
## Not run: get_trial_types() get_trial_types(dataset_name = "pomper_saffran_2016") ## End(Not run)
## Not run: get_trial_types() get_trial_types(dataset_name = "pomper_saffran_2016") ## End(Not run)
Get trials
get_trials(dataset_id = NULL, dataset_name = NULL, connection = NULL)
get_trials(dataset_id = NULL, dataset_name = NULL, connection = NULL)
dataset_id |
An integer vector of one or more dataset ids |
dataset_name |
A character vector of one or more dataset names |
connection |
A connection to the peekbank database |
A 'tbl' of Trials data, filtered down by supplied arguments. If 'connection' is supplied, the result remains a remote query, otherwise it is retrieved into a local tibble.
## Not run: get_trials() get_trials(dataset_name = "pomper_saffran_2016") ## End(Not run)
## Not run: get_trials() get_trials(dataset_name = "pomper_saffran_2016") ## End(Not run)
Get XY timepoints
get_xy_timepoints( dataset_id = NULL, dataset_name = NULL, age = NULL, connection = NULL )
get_xy_timepoints( dataset_id = NULL, dataset_name = NULL, age = NULL, connection = NULL )
dataset_id |
An integer vector of one or more dataset ids |
dataset_name |
A character vector of one or more dataset names |
age |
A numeric vector of a single age or a min age and max age (inclusive), in months |
connection |
A connection to the peekbank database |
A 'tbl' of XY timepoints data, filtered down by supplied arguments. If 'connection' is supplied, the result remains a remote query, otherwise it is retrieved into a local tibble.
## Not run: get_xy_timepoints(dataset_name = "pomper_saffran_2016") ## End(Not run)
## Not run: get_xy_timepoints(dataset_name = "pomper_saffran_2016") ## End(Not run)
List of peekbank tables
list_peekbank_tables(connection)
list_peekbank_tables(connection)
connection |
A connection to the peekbank database |
A vector of the names of tables in peekbank
## Not run: con <- connect_to_peekbank() list_peekbank_tables(con) ## End(Not run)
## Not run: con <- connect_to_peekbank() list_peekbank_tables(con) ## End(Not run)
Populate the provided cdi data with percentile values for that specific age, instrument_type, measure and language. Loosely based on the work from this repo https://github.com/kachergis/cdi-percentiles/tree/main by George Kachergis and Jess Mankewitz with advice from Virginia Marchman.
populate_cdi_percentiles(subjects_table)
populate_cdi_percentiles(subjects_table)
subjects_table |
a subjects table with unnested cdi data, needs columns "subject_id", "language", "instrument_type", "age", "sex", "measure", "rawscore" |
the input table with added columns containing the reference age used, the reference year used, and both gender specific and general percentile values for the cdi score
full_cdi_data <- all_subjects %>% unnest(subject_aux_data) %>% filter(!is.na(cdi_responses)) %>% unnest(cdi_responses) %>% peekbankr::cleanup_cdi_data() %>% peekbankr::populate_cdi_percentiles()
full_cdi_data <- all_subjects %>% unnest(subject_aux_data) %>% filter(!is.na(cdi_responses)) %>% unnest(cdi_responses) %>% peekbankr::cleanup_cdi_data() %>% peekbankr::populate_cdi_percentiles()
Unpack the json sting in the *_aux_data column and turns it into a nested R list
unpack_aux_data(df)
unpack_aux_data(df)
df |
a dataframe in the peekbank format that has an aux data column |
the input dataframe, with the *_aux_data column unpacked
## Not run: subjects_table <- unpack_aux_data(df = subjects_table) ## End(Not run)
## Not run: subjects_table <- unpack_aux_data(df = subjects_table) ## End(Not run)