--- title: "Accessing the Wordbank database" author: "Mika Braginsky" date: "`r Sys.Date()`" output: rmarkdown::html_vignette: toc: true vignette: > %\VignetteIndexEntry{Accessing the Wordbank database} %\VignetteEncoding{UTF-8} %\VignetteEngine{knitr::rmarkdown} --- ```{r, include=FALSE} library(wordbankr) library(dplyr) library(ggplot2) knitr::opts_chunk$set(message = FALSE, warning = FALSE, cache = FALSE) theme_set(theme_minimal()) con <- connect_to_wordbank() can_connect <- !is.null(con) knitr::opts_chunk$set(eval = can_connect) ``` The `wordbankr` package allows you to access data in the [Wordbank database](http://wordbank.stanford.edu/) from `R`. This vignette shows some examples of how to use the data loading functions and what the resulting data look like. There are three different data views that you can pull out of Wordbank: by-administration, by-item, and administration-by-item. Additionally, you can get metadata about the datasets and instruments underlying the data. Advanced functionality let's you get estimates of words' age of acquisition and word mappings across languages. ## Administrations The `get_administration_data()` function gives by-administration information, either for a specific language and/or form or for all instruments. ```{r} get_administration_data(language = "English (American)", form = "WS") get_administration_data() ``` ## Items The `get_item_data()` function gives by-item information, either for a specific language and/or form or for all instruments. ```{r} get_item_data(language = "Italian", form = "WG") get_item_data() ``` ## Administrations x Items If you are only looking at total vocabulary size, `admins` is all you need, since it has both productive and receptive vocabulary sizes calculated. If you are looking at specific items or subsets of items, you need to load instrument data, using the `get_instrument_data()` function. Pass it an instrument language and form, along with a list of items you want to extract (by `item_id`). ```{r} get_instrument_data( language = "English (American)", form = "WS", items = c("item_26", "item_46") ) ``` By default `get_instrument_table()` returns a data frame with columns of the administration's `data_id`, the item's `num_item_id` (numerical `item_id`), and the corresponding value. To include administration information, you can set the `administrations` argument to `TRUE`, or pass the result of `get_administration_data()` as `administrations` (that way you can prevent the administration data from being loaded multiple times). Similarly, you can set the `iteminfo` argument to `TRUE`, or pass it result of `get_item_data()`. Loading the data is fast if you need only a handful of items, but the time scales about linearly with the number of items, and can get quite slow if you need many or all of them. So, it's a good idea to filter down to only the items you need before calling `get_instrument_data()`. As an example, let's say we want to look at the production of animal words on English Words & Sentences over age. First we get the items we want: ```{r, fig.width=6, fig.height=4} items <- get_item_data(language = "English (American)", form = "WS") if (!is.null(items)) { animals <- items %>% filter(category == "animals") } ``` Then we get the instrument data for those items: ```{r} if (!is.null(animals)) { animal_data <- get_instrument_data(language = "English (American)", form = "WS", items = animals$item_id, administration_info = TRUE, item_info = TRUE) } ``` Finally, we calculate how many animals words each child produces and the median number of animals of each age bin: ```{r, fig.width=6, fig.height=4} if (!is.null(animal_data)) { animal_summary <- animal_data %>% group_by(age, data_id) %>% summarise(num_animals = sum(produces, na.rm = TRUE)) %>% group_by(age) %>% summarise(median_num_animals = median(num_animals, na.rm = TRUE)) ggplot(animal_summary, aes(x = age, y = median_num_animals)) + geom_point() + labs(x = "Age (months)", y = "Median animal words producing") } ``` ## Metadata ### Instruments The `get_instruments()` function gives information on all the CDI instruments in Wordbank. ```{r} get_instruments() ``` ### Datasets The `get_datasets()` function gives information on all the datasets in Wordbank, either for a specific language and/or form or for all instruments. If the `admin_data` argument is set to `TRUE`, the results will also include the number of administrations in the database from that dataset. ```{r} get_datasets(form = "WG") get_datasets(language = "Spanish (Mexican)", admin_data = TRUE) ``` ## Advanced functionality: Age of acquisition The `fit_aoa()` function computes estimates of items' age of acquisition (AoA). It needs to be provided with a data frame returned by `get_instrument_data()` -- one row per administration x item combination, and minimally the columns `age` and `num_item_id`. It returns a data frame with one row per item and an `aoa` column with the estimate, preserving and item-level columns in the input data. The AoA is estimated by computing the proportion of administrations for which the child understands/produces (`measure`) each word, smoothing the proportion using `method`, and taking the age at which the smoothed value is greater than `proportion`. ```{r} fit_aoa(animal_data) fit_aoa(animal_data, method = "glmrob", proportion = 1/3) ``` ## Advanced functionality: Cross-linguistic data One of the item-level fields is `uni_lemma` ("universal lemma"), which is intended to be an approximate semantic mapping between words across the languages in Wordbank. The function `get_crossling_items()` simply gives all the available `uni_lemma` values. ```{r} get_crossling_items() ``` The function `get_crossling_data()` takes a vector of `uni_lemmas` and returns a data frame of summary statistics for each item mapped to that uni_lemma in any language (on `WG` forms). Each row is combination of item and age, and the columns indicate the number of children (`n_children`), means (`comprehension`, `production`), standard deviations (`comprehension_sd`, `production_sd`), and item-level fields. ```{r, eval=FALSE} get_crossling_data(uni_lemmas = c("hat", "nose")) %>% select(language, uni_lemma, item_definition, age, n_children, comprehension, production, comprehension_sd, production_sd) %>% arrange(uni_lemma) ```