String Manipulation in R – turning text into useful form for analysis

Brito, Rivas, Nicaragua

String manipulation refers to the process of modifying, analyzing, or transforming strings, into useful form for analysis. This is a fundamental aspect of programming and is used in various applications, such as data processing, text analysis, and user input handling.

R has powerful repertoire of functions for string manipulation. These functions are used in Pattern Matching. Pattern matching in strings involves checking a sequence of characters (the pattern) within another sequence of characters (the text). This is a crucial concept in data science, used in various applications like text processing, data mining, and information retrieval. In this article we will look at some of the most frequently used functions for string manipulation.

  • tolower and toupper functions to change the character case
  • nchar function for counting characters
  • Function to split texts
  • Functions used in pattern matching

Let’s get started

Changing cases is useful to ensure that the input strings are consistent in cases. The tolower() function changes the texts to lowercase letters, while toupper() does the opposite. The two functions are vectorised, that is, it changes the cases of each string element of the given character vector:

tolower(c("Hello", "world"))
[1] "hello" "world"

toupper(c("Hello", "world"))
[1] "HELLO" "WORLD"

Another useful function is nchar(), which simply counts the number of characters of each element of a character vector.

nchar(c("Hello","R","User"))
[1] 5 1 4

To count number of elements in string use length()

length(c("Hello","R","User"))
[1] 3

To extract the useful parts, we need to split the texts and make each part accessible. The strsplit() function is used to split texts by specific separators given as character vector:

# strings are separated by blank space
strsplit("Siddhart Sahasrabudhe",split = " ")
[[1]]
[1] "Siddhart"     "Sahasrabudhe"

# strings are separated by comma
strsplit("a,bb,ccc",split = ",")
[[1]]
[1] "a"   "bb"  "ccc"

The function returns a list. Each element in the list is character vector produced from splitting that element in the original character vector.

students <- strsplit(c("Siddharth,26,physics","Gayatri,23,chemistry"),split=",")
students

[[1]]
[1] "Siddharth" "26"        "physics"  

[[2]]
[1] "Gayatri"   "23"        "chemistry"

strsplit() function is more powerful than is shown. It also supports regular expressions, a very powerful framework to process text data.

We will cover four important function for pattern matching and string replacement

  • grep function – the function returns string position after pattern matching
  • grepl function – the function returns boolean output after pattern matching
  • sub function – replaces string character of first match
  • gsub function – replaces string character of all matches

To demonstrate the use of grep() function, we will use state.division built in data set. This data set is part of state metadata giving US state divisions. Check ?state for more information.

# since the data is in factor, we will convert it into character type
head(as.character(state.division))
[1] "East South Central" "Pacific"           
[3] "Mountain"           "West South Central"
[5] "Pacific"            "Mountain"   

The first argument of grep() is the pattern, while the second is the character vector where matches are sought.

grep(pattern = <regrex> , x = <string>)

We want to find out the state divisions having “North” in their name. In this case, we’re looking for the pattern “North”, because we want to find the state divisions that have an “North” in their name. The x argument is equal to state.division, the character vector.

grep(pattern = "North" , x = state.division)
 [1] 13 14 15 16 22 23 25 27 34 35 41 49

grep function returns element values or indices as the output.

Use value=TRUE to show element value.

grep(pattern = "North" , x = state.division,value = TRUE)
 [1] "East North Central" "East North Central"
 [3] "West North Central" "West North Central"
 [5] "East North Central" "West North Central"
 [7] "West North Central" "West North Central"
 [9] "West North Central" "East North Central"
 [11] "West North Central" "East North Central"

To find pattern in character vector and to have logical (TRUE/FALSE) outputs use grepl() function.

grepl(pattern = "North|South", x = state.division)

[1]  TRUE FALSE FALSE  TRUE FALSE FALSE FALSE  TRUE  TRUE
[10]  TRUE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
[19] FALSE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE
[28] FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE
[37] FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE FALSE FALSE
[46]  TRUE FALSE  TRUE  TRUE FALSE

To find the count of matches, we can wrap the function into sum()

sum(grepl(pattern = "North|South", x = state.division))
[1] 28

We now have covered some basics on how to check for the patterns inside a vector of character strings. R, however, also provides some functions to directly replace these matches with other strings.

Let’s look at sub function. It basically takes three arguments: pattern, replacement, and x. Once again the pattern argument corresponds to the regular expression you want to match strings. x is the character vector where these matches are sought. Finally, you assign a replacement value for the matches to the replacement argument.

sub() function replaces first matching occurrence of a pattern in the string.

# sub function template
sub(pattern = <regex>, replacement = <str> , x = <str>)

new <- c("New York","new new York","New New New York")
new
[1] "New York"         "new new York"     "New New New York"

# default is case sensitive.  
sub("New",replacement = "Old",new)
[1] "Old York"         "new new York"     "Old New New York"

Use ignore.case=TRUE to ignore the case.

sub("New",replacement = "Old",new,ignore.case = TRUE)
[1] "Old York"         "Old new York"     "Old New New York"

You may notice that second string – “new new York” is also replaced by “Old new York”. This is possible because we ignored the cases.

To replace all matching occurrences of pattern, use gsub()

gsub("New",replacement = "Old",new)
[1] "Old York"         "new new York"     "Old Old Old York"

Use ignore.case=TRUE to ignore the case.

gsub("New",replacement = "Old",new,ignore.case=TRUE)
[1] "Old York"         "Old Old York"     "Old Old Old York"

In this article, we learned:

  • How to perform text transformations such as changing to lower or upper case letters, counting the number of character in string and replacing string with another string
  • How to match patterns to check the specific sequence of characters

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top