String manipulation refers to the process of modifying, analyzing, or transforming strings, into useful form for analysis. This is a fundamental aspect of programming and is used in various applications, such as data processing, text analysis, and user input handling.
R has powerful repertoire of functions for string manipulation. These functions are used in Pattern Matching. Pattern matching in strings involves checking a sequence of characters (the pattern) within another sequence of characters (the text). This is a crucial concept in data science, used in various applications like text processing, data mining, and information retrieval. In this article we will look at some of the most frequently used functions for string manipulation.
tolower
andtoupper
functions to change the character casenchar
function for counting characters- Function to split texts
- Functions used in pattern matching
Let’s get started
Changing cases
Changing cases is useful to ensure that the input strings are consistent in cases. The tolower()
function changes the texts to lowercase letters, while toupper()
does the opposite. The two functions are vectorised, that is, it changes the cases of each string element of the given character vector:
tolower(c("Hello", "world"))
[1] "hello" "world"
toupper(c("Hello", "world"))
[1] "HELLO" "WORLD"
Counting characters
Another useful function is nchar()
, which simply counts the number of characters of each element of a character vector.
nchar(c("Hello","R","User"))
[1] 5 1 4
To count number of elements in string use length()
length(c("Hello","R","User"))
[1] 3
Splitting texts
To extract the useful parts, we need to split the texts and make each part accessible. The strsplit()
function is used to split texts by specific separators given as character vector:
# strings are separated by blank space
strsplit("Siddhart Sahasrabudhe",split = " ")
[[1]]
[1] "Siddhart" "Sahasrabudhe"
# strings are separated by comma
strsplit("a,bb,ccc",split = ",")
[[1]]
[1] "a" "bb" "ccc"
The function returns a list. Each element in the list is character vector produced from splitting that element in the original character vector.
students <- strsplit(c("Siddharth,26,physics","Gayatri,23,chemistry"),split=",")
students
[[1]]
[1] "Siddharth" "26" "physics"
[[2]]
[1] "Gayatri" "23" "chemistry"
strsplit()
function is more powerful than is shown. It also supports regular expressions, a very powerful framework to process text data.
Pattern Matching and Replacement
We will cover four important function for pattern matching and string replacement
grep
function – the function returns string position after pattern matchinggrepl
function – the function returns boolean output after pattern matchingsub
function – replaces string character of first matchgsub
function – replaces string character of all matches
grep function
To demonstrate the use of grep()
function, we will use state.division
built in data set. This data set is part of state
metadata giving US state divisions. Check ?state
for more information.
# since the data is in factor, we will convert it into character type
head(as.character(state.division))
[1] "East South Central" "Pacific"
[3] "Mountain" "West South Central"
[5] "Pacific" "Mountain"
The first argument of grep()
is the pattern, while the second is the character vector where matches are sought.
grep(pattern = <regrex> , x = <string>)
We want to find out the state divisions having “North” in their name. In this case, we’re looking for the pattern “North”, because we want to find the state divisions that have an “North” in their name. The x argument is equal to state.division
, the character vector.
grep(pattern = "North" , x = state.division)
[1] 13 14 15 16 22 23 25 27 34 35 41 49
grep
function returns element values or indices as the output.
Use value=TRUE
to show element value.
grep(pattern = "North" , x = state.division,value = TRUE)
[1] "East North Central" "East North Central"
[3] "West North Central" "West North Central"
[5] "East North Central" "West North Central"
[7] "West North Central" "West North Central"
[9] "West North Central" "East North Central"
[11] "West North Central" "East North Central"
grepl function
To find pattern in character vector and to have logical (TRUE/FALSE) outputs use grepl()
function.
grepl(pattern = "North|South", x = state.division)
[1] TRUE FALSE FALSE TRUE FALSE FALSE FALSE TRUE TRUE
[10] TRUE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE
[19] FALSE TRUE FALSE TRUE TRUE TRUE TRUE FALSE TRUE
[28] FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE
[37] FALSE FALSE FALSE TRUE TRUE TRUE TRUE FALSE FALSE
[46] TRUE FALSE TRUE TRUE FALSE
To find the count of matches, we can wrap the function into sum()
sum(grepl(pattern = "North|South", x = state.division))
[1] 28
sub() function
We now have covered some basics on how to check for the patterns inside a vector of character strings. R, however, also provides some functions to directly replace these matches with other strings.
Let’s look at sub
function. It basically takes three arguments: pattern, replacement, and x. Once again the pattern argument corresponds to the regular expression you want to match strings. x is the character vector where these matches are sought. Finally, you assign a replacement value for the matches to the replacement argument.
sub()
function replaces first matching occurrence of a pattern in the string.
# sub function template
sub(pattern = <regex>, replacement = <str> , x = <str>)
new <- c("New York","new new York","New New New York")
new
[1] "New York" "new new York" "New New New York"
# default is case sensitive.
sub("New",replacement = "Old",new)
[1] "Old York" "new new York" "Old New New York"
Use ignore.case=TRUE
to ignore the case.
sub("New",replacement = "Old",new,ignore.case = TRUE)
[1] "Old York" "Old new York" "Old New New York"
You may notice that second string – “new new York” is also replaced by “Old new York”. This is possible because we ignored the cases.
gsub function
To replace all matching occurrences of pattern, use gsub()
gsub("New",replacement = "Old",new)
[1] "Old York" "new new York" "Old Old Old York"
Use ignore.case=TRUE to ignore the case.
gsub("New",replacement = "Old",new,ignore.case=TRUE)
[1] "Old York" "Old Old York" "Old Old Old York"
Summary
In this article, we learned:
- How to perform text transformations such as changing to lower or upper case letters, counting the number of character in string and replacing string with another string
- How to match patterns to check the specific sequence of characters