String Manipulation in R – turning text into useful form for analysis

String manipulation refers to the process of modifying, analyzing, or transforming strings, into useful form for analysis. This is a fundamental aspect of programming and is used in various applications, such as data processing, text analysis, and user input handling.

R has powerful repertoire of functions for string manipulation. These functions are used in Pattern Matching. Pattern matching in strings involves checking a sequence of characters (the pattern) within another sequence of characters (the text). This is a crucial concept in data science, used in various applications like text processing, data mining, and information retrieval. In this article we will look at some of the most frequently used functions for string manipulation.

tolower and toupper functions to change the character case
nchar function for counting characters
Function to split texts
Functions used in pattern matching

Let’s get started

Changing cases

Changing cases is useful to ensure that the input strings are consistent in cases. The tolower() function changes the texts to lowercase letters, while toupper() does the opposite. The two functions are vectorised, that is, it changes the cases of each string element of the given character vector:

tolower(c("Hello", "world"))
[1] "hello" "world"

toupper(c("Hello", "world"))
[1] "HELLO" "WORLD"

Counting characters

Another useful function is nchar(), which simply counts the number of characters of each element of a character vector.

nchar(c("Hello","R","User"))
[1] 5 1 4

To count number of elements in string use length()

length(c("Hello","R","User"))
[1] 3

Splitting texts

To extract the useful parts, we need to split the texts and make each part accessible. The strsplit() function is used to split texts by specific separators given as character vector:

# strings are separated by blank space
strsplit("Siddhart Sahasrabudhe",split = " ")
[[1]]
[1] "Siddhart"     "Sahasrabudhe"

# strings are separated by comma
strsplit("a,bb,ccc",split = ",")
[[1]]
[1] "a"   "bb"  "ccc"

The function returns a list. Each element in the list is character vector produced from splitting that element in the original character vector.

students <- strsplit(c("Siddharth,26,physics","Gayatri,23,chemistry"),split=",")
students

[[1]]
[1] "Siddharth" "26"        "physics"  

[[2]]
[1] "Gayatri"   "23"        "chemistry"

strsplit() function is more powerful than is shown. It also supports regular expressions, a very powerful framework to process text data.

Pattern Matching and Replacement

We will cover four important function for pattern matching and string replacement

grep function – the function returns string position after pattern matching
grepl function – the function returns boolean output after pattern matching
sub function – replaces string character of first match
gsub function – replaces string character of all matches

grep function

To demonstrate the use of grep() function, we will use state.division built in data set. This data set is part of state metadata giving US state divisions. Check ?state for more information.

# since the data is in factor, we will convert it into character type
head(as.character(state.division))
[1] "East South Central" "Pacific"           
[3] "Mountain"           "West South Central"
[5] "Pacific"            "Mountain"

The first argument of grep() is the pattern, while the second is the character vector where matches are sought.

grep(pattern = <regrex> , x = <string>)

We want to find out the state divisions having “North” in their name. In this case, we’re looking for the pattern “North”, because we want to find the state divisions that have an “North” in their name. The x argument is equal to state.division, the character vector.

grep(pattern = "North" , x = state.division)
 [1] 13 14 15 16 22 23 25 27 34 35 41 49

grep function returns element values or indices as the output.

Use value=TRUE to show element value.

grep(pattern = "North" , x = state.division,value = TRUE)
 [1] "East North Central" "East North Central"
 [3] "West North Central" "West North Central"
 [5] "East North Central" "West North Central"
 [7] "West North Central" "West North Central"
 [9] "West North Central" "East North Central"
 [11] "West North Central" "East North Central"

grepl function

To find pattern in character vector and to have logical (TRUE/FALSE) outputs use grepl() function.

grepl(pattern = "North|South", x = state.division)

[1]  TRUE FALSE FALSE  TRUE FALSE FALSE FALSE  TRUE  TRUE
[10]  TRUE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
[19] FALSE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE
[28] FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE
[37] FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE FALSE FALSE
[46]  TRUE FALSE  TRUE  TRUE FALSE

To find the count of matches, we can wrap the function into sum()

sum(grepl(pattern = "North|South", x = state.division))
[1] 28

sub() function

We now have covered some basics on how to check for the patterns inside a vector of character strings. R, however, also provides some functions to directly replace these matches with other strings.

Let’s look at sub function. It basically takes three arguments: pattern, replacement, and x. Once again the pattern argument corresponds to the regular expression you want to match strings. x is the character vector where these matches are sought. Finally, you assign a replacement value for the matches to the replacement argument.

sub() function replaces first matching occurrence of a pattern in the string.

# sub function template
sub(pattern = <regex>, replacement = <str> , x = <str>)

new <- c("New York","new new York","New New New York")
new
[1] "New York"         "new new York"     "New New New York"

# default is case sensitive.  
sub("New",replacement = "Old",new)
[1] "Old York"         "new new York"     "Old New New York"

Use ignore.case=TRUE to ignore the case.

sub("New",replacement = "Old",new,ignore.case = TRUE)
[1] "Old York"         "Old new York"     "Old New New York"

You may notice that second string – “new new York” is also replaced by “Old new York”. This is possible because we ignored the cases.

gsub function

To replace all matching occurrences of pattern, use gsub()

gsub("New",replacement = "Old",new)
[1] "Old York"         "new new York"     "Old Old Old York"

Use ignore.case=TRUE to ignore the case.

gsub("New",replacement = "Old",new,ignore.case=TRUE)
[1] "Old York"         "Old Old York"     "Old Old Old York"

Summary

In this article, we learned:

How to perform text transformations such as changing to lower or upper case letters, counting the number of character in string and replacing string with another string
How to match patterns to check the specific sequence of characters