--- title: "Recoding nominal data" author: "Michael Cysouw" date: "`r Sys.Date()`" output: rmarkdown::html_vignette: null md_document: variant: markdown_github html_document: df_print: paged pdf_document: number_sections: yes vignette: > %\VignetteIndexEntry{Recoding nominal data} \usepackage[utf8]{inputenc} %\VignetteEngine{knitr::rmarkdown} --- ```{r setup, include = FALSE} library(qlcData) ``` # Introduction A common situation in comparative linguistic data collection is that data is encoded as nominal ('categorical') attributes. An attribute is conceived as a finite list of possibilities, viz. the values of the attribute. Although this is of course a completely normal and widespread practice in data encoding, in comparative linguistics it is mostly not trivial to decide on the delimitation of the different values. There is often ample discussion about the best way to separate the attested variation into types (terms like 'types', 'categories' or 'values' will the considered as equivalent here), and in general there is often no optimal or preferred way how to define the different types. In practice then, different scholars will often like to interpret data differently. One common wish is to be able to recode data that has already been categorised into types by another scholar. Note that such a recoding of course will never be able to easily split types, because for that goal a complete reconsideration of all underlying raw data is necessary, something that will not be further considered here. Given any given nominal data set (like for example the WALS data, as included in this package), there are various transformations that are often requested, and that are reasonably easy to perform: *merge values* and *split attributes*. A third kind recoding, *merge attributes* is also possible, but needs a bit more effort. Furthermore, the actual recoding will consist of just a few lists of information, which allows for an easy way to share and publish the actual recoding decisions in the form of a *recoding profile*. To produce such profiles, a *recoding template* is proposed that can be used to quickly document any recoding decisions. # Merge values As an example, consider the following toy data frame with two attributes `size` and `kind`. ```{r nominalData} data <- data.frame( size = c('large','very small','small','large','very small','tiny'), kind = c('young male','young female','old female','old male',NA,'young female'), row.names = paste('obs', 1:6, sep='') ) data ``` The first kind of recoding to be exemplified here is to merge values. The first attribute of in our data has four values: `large`, `small`, `very small` and `tiny`. Suppose we would like to merge the values `small`, `very small` and `tiny` into one value value `small`. What we have to do is to define a new attribute with new values, and link the original values to our new values. In practice such a recoding looks as shown below: a list with a new name for the new attribute (`attribute=`), names for the new values of this attribute (`values=`), the name of the original attribute that is to be recoded (`recodingOf=`), and the central informatin for the recoding: the link-vector (`link=`). The link-vector has the same length as to the number of values of the original attribute in the order as given by `levels(data$size)` viz. in this case `r levels(data$size)`.[^1] These four original values are linked to the new values as specified in the link-vector: the first original value is linked to the first new value (1), the second original value is linked to the second new value (2), the third original value to the second new value (2), etc. A zero in this link-vector is designated for values that should not be linked (i.e. NA, but that does not work in YAML, see below.) [^1]: The order of the levels is crucial when using a numerical link, and this order might be different depending on your locale! A more stable way is not to use a numerical link, but a named link, see below. The function `recode` taked the original data and the recoding-specification, and returns the new, recoded, data: ```{r merge} # Specifying a recoding recoding <- list( list( recodingOf = 'size', attribute = 'newSize', values = c('large','small'), link = c(1,2,2,2) ) ) # Do the actual recoding and see the results recode(recoding, data) ``` # Split attributes The recoding-object has a doubly-embedded list structure, which might seem superfluous, but this is because the example above only specifies a single new attribute. To specify more than one attribute, simply various such specification can be listed, as illustrated below. In the following example, the second original attribute (`kind`) is split into two new attributes (`gender` and `age`), but such a split is simply represented as two different ways of merging the values. In total, our recoding example has now been extended to recoding three different new attributes. ```{r mergesplit} # Specifying the recoding of three different new attributes recoding <- list( list( recodingOf = 'size', attribute = 'size', values = c('large','small'), link = c(1,2,2,2) ), list( recodingOf = 'kind', attribute = 'gender', values = c('female','male'), link = c(1,2,1,2) ), list( recodingOf = 'kind', attribute = 'age', values = c('old','young'), link = c(1,1,2,2) ) ) # Do the recoding and show it newdata <- recode(recoding, data) newdata ``` # Merge attributes Combining various attributes into a singly new attribute works very similar, only that there are multiple attributes specified at `recodingOf`. Note that there are various zeros in the link-vector, specifying that some value-combinations are not to be linked. ```{r mergeattributes} back_recoding <- list( list( recodingOf = c('size','age'), attribute = 'size+age', values = c('largeOld','largeYoung','smallOld','smallYoung'), link = c(1,3,0,2,4,0,0,0,0,0) ) ) recode(back_recoding, newdata) ``` To get the indices (including zeros) of the link-vector right, one has to realize that the recoding of two attributes is based on the cross-section of the values from the two attributes, including possible NAs. Internally, this uses the function `expand.grid`, leading in the current example to the following four values to be recoded. For larger mergers (with multiple attributes, or with attributes that have many values) this can become rather tedious, because there are very many possible combinations that all have to linked in the link-vector. For such situations it is possible to use the option `all.options = FALSE` in the recoding template (see below) which results in only the attested combinations being listed. This of course is not foolproof if the dataset is expanded after the recoding profile is written. However, for non-changeable data this option can be highly useful. ```{r expandgrid} expand.grid(c(levels(newdata$size),NA),c(levels(newdata$age),NA)) ``` # Using recoding templates Specifying recodings is often rather tedious within R. Also, the resulting nested list datastructure in R is not very insightful to share or publish. As an alternative, I propose to use a YAML representation of the recoding for editing and sharing. The function `write.recoding` can be used to produce a template that can then be manually edited. All the necessary information for the recoding will be included in the file. The list of the attributes that one wants to recode should be specified as a **list** in the function `write.recoding`. In that way it is possible to both recode individual attributes, but also combinations of attributes. For example, `write.recoding(data = data, attributes = list(1, c(1,2)), file = file)` will write the following YAML information to `file`. The tildes `~` show the missing information to be added. Note that the second recoding is a combination of two attributes. The function `recode` accepts a path to such a YAML-file as an input of a recoding. ```{r yamloutput, echo = FALSE} cat(yaml::as.yaml(write.recoding(data, attributes=list(1,c(1,2))))) ``` # Named linking As can be seen in the recoding template, it is best practice to use named linking. This works as follows: - **The new values are now "key: value" pairs**, in which the key can be whatever you like (the template suggests 'a' and 'b', but this can be changed, and of course more keys can be added) - **The link are also "key: value" pairs**, in which the keys are the original attributes, and the values are to be added by the user: they should be the keys from the new values, i.e. 'a' or 'b' or whatever you have chosen for the new values. - **Any original attributes not named in 'link' are ignored**, i.e. they will get NA values after recoding. So, if you leave the tilde, or if you delete a line from the link-list, then these original values will be turned into NA after recoding. - **Combinations of attributes use the fixed delimiter ' + ', i.e. 'space-plus-space'**. You can easily build new combinations by combining original levels with the combining-delimiter 'space-plus-space'. This delimiter was chosen with the assumption that is will normally not occur in the data, and it is easily typeable and readable. Named linking is also extremely useful for merging of attributes. Repeating the example from above, but using names linking (note the sequence 'space-plus-space' in the key-names of the link. Also note the complete absence of any combination that should not be recoded and the unimportance of the order): ```{r mergeattributes_v2} back_recoding <- list( list( recodingOf = c('size','age'), attribute = 'size+age', values = c( lo = 'largeOld' , ly = 'largeYoung' , so = 'smallOld' , sy = 'smallYoung' ), link = c( 'small + young' = 'sy' , 'small + old' = 'so' , 'large + young' = "ly" , 'large + old' = 'lo' ) ) ) recode(back_recoding, newdata) ``` Also note that this recoding looks much better and is easier to produce in YAML: ``` recoding: - recodingOf: - size - age attribute: size+age values: - lo: largeOld - ly: largeYoung - so: smallOld - sy: smallYoung link: - 'small + young': sy - 'small + old': so - 'large + young': ly - 'large + old': lo ``` # Using recoding shortcuts It is of course also possible to just manually write a recoding structure, either directly as a list within R or as a YAML-file. To make this even easier, the function `read.recoding` (used internally in `recode` as well) allows for various shortcuts in the formulation of a recoding: - **Order is unimportant:** Because every recoding is a labelled list, the ordering of the specifications can be entered at will. The order will be harmonized by `read.recoding` - **Abbreviate labels:** The labels of the specifications (like `attribute` or `link`) can be abbreviated, and in practice the first letter suffices. - **Leave out names:** For a recoding, only `link` and `recodingOf` are minimally necessary. New attribute and value names are added automatically when nothing is specified. The automatically specified names are not very useful though (they look like `Att1` or `Val4`). Manual specification of names is strongly preferred. - **Keep original attribute by not specifying `link`:** When no `link` is specified, the original attribute from `recodingOf` will simply be copied verbatim, without any recoding. - **Use column numbers instead of attribute names:** Instead of the names of the original attributes it is also possible to specify the number of the column in the original data frame. - **Use `doNotRecode` to keep original attributes:** To add original attributes without recoding them it is also possible to use `doNotRecode=` (possible abbreviated as `d=`), followed by a vector with the column numbers of the original data to be copied. To illustrate these possibilities, consider the following recoding of our toy dataset: ```{r shortcuts} short_recoding <- list( # same as first example at the start of this vignette # using abbreviations and a different order list( r = 'size', a = 'newSize', l = c(1,2,2,2), v = c('large','small') ), # the same, using named linking list( r = 'size', a = 'newSize', v = list(a = 'large', b = 'small'), l = list(tiny = 'b', 'very small' = 'b', small = 'b', large = 'a') ), # same new attribute, but with automatically generated names list( r = 'size', l = c(1,2,2,2) ), # keep original attribute in column 2 of the data list( r = 2 ), # add three times the first original attribute # senseless, but it illustrates the possibilities list( d = c(1,2) ) ) recode(short_recoding, data) ``` Note that this short_recoding would be really short and simple when written manually in YAML: ``` recoding: - r: size a: newSize v: [large, small] l: [1,2,2,2] - r: size a: newSize v: - a: large - b: small l: - tiny: b - very small: b - small : b - large: a - r: size l: [1,2,2,2] - r: 2 - d: [1,2] ``` To document the recoding, it is to be preferred to expand all the shortcuts to their full text. This can be done by using `read.recoding`. When `file` is specified here, then the result is written to a YAML file that can be easily shared or published as documentation of the recoding. ```{r expand_shortcuts_notrun, eval = FALSE} read.recoding(short_recoding, file = yourFile, data = data) ``` ```{r expand_shortcuts, echo = FALSE} meta <- list( title = NULL, author = NULL, date = format(Sys.time(),"%Y-%m-%d"), originalData = "data" ) recoding <- read.recoding(short_recoding, file = NULL , data = data) outfile <- c(meta, list(recoding = recoding)) cat(yaml::as.yaml(outfile)) ```