String in R
Posted on Nov 20, 2012 in Programming
Things under legendu.net/outdated are outdated technologies that the author does not plan to update any more. Please look for better alternatives.
** Things under legendu.net/outdated are outdated technologies that the author does not plan to update any more. Please look for better alternatives. **
The R package stringi is a great one.
It is suggested you used string functions in the stringi package
rather than in the base package (grep, sub, etc.) when possible.
Functions in the "stringi" Package
stri_trans_totitle%s+%pastes 2 strings together. Vector is opteration is supported. However, there is one difference between%s+%andpaste. Applying%s+%onNAreturnsNAwhile paste treatsNAas an empty string. Actually allstri_*functions respectNApropagation, that is applying anystri_*function onNAreturnsNA.
Misc
-
R 3.3.0 and above: validUTF8(x)
-
Function
chartrcan be used to substitute old characters to new characters. If new character is null then this functions can be used to drop characters from the original string. -
Function
ncharcan be used to get the length of characters while functionlengthcan only be used to get the length of vectors or matrixs. -
For special characters, we can put a backslash before it to get it, or I think we can put it into a pair of single quotation mark.
-
R function
pastecan concatenate several vectors and it can also concatenate the elements in a vector. -
Using function
expressionandevalwe can achieve symbolic computation. In addition, functionparsemight be useful. -
Function
match.argcan be used to partially match strings. -
To replace a substring of a string with a new one, you can use
suborgsub. The difference betweensubandgsubis thatsubonly replaces the first occurrence whilegsubreplaces all occurrences. However, sometimes, you might want to replace substring by index which gives more accurate control of substituting. For example, when there are multiple occurrences of a substring and you only want to replace the 2nd one, then neithersubnorgsubworks well here.substrandsubstringare good alternatives in this situation. These two functions work similarly to vectors and matrices, which means that you can use these two functions to both extract and replace substrings. If you want to replace an element of a vector/matrix to a new one, you can just assign a new value to the element. Similarly to replace substring of an object string usingsubstrorsubstring, you can simply assign a new value for the substring. However,substrandsubstringcan only replace a substring with the same length, if the argumentreplacementis not long enough, then only partial of the substring specified will be replaced; if the argumentreplacementis too long, then it will be truncated to have the same length with the substring to be replaced. If you want to replace a substring specified by index with any new string, you can usedclong.String::strReplace. -
strwidthcalculates the width of a string when displayed on a graphics device. -
strsplitsplits a string according to specified delimiters. strsplitis based on regular expression by default. You can use literal string by specifying the optionfixed = TRUE(similar to other regular expression functions). For example, you can use the following code to splitNCO_MTG_Per_L1 + LHPIRT_b + URD_b` using the plus sign.
strsplit("NCO_MTG_Per_L1 + LHPIRT_b + URD_b", "+", fixed=TRUE)
Notice that splitting an empty string results in an empty string, which is not a good behavior. Returning an empty string is better.
> strsplit(c("", "1:2:3"), ":")
[[1]]
character(0)
[[2]]
[1] "1" "2" "3"
If none of the strings end with the delimiter, then a trick to resolve the issue is to add an extra delimiter to each string before splitting.
> strsplit(c(":", "1:2:3"), ":")
[[1]]
[1] ""
[[2]]
[1] "1" "2" "3"
IO
-
By default, characters read into R by
read.table(and alike functions) is converted to factors. While it makes modeling convenient in R, it's usually inconvenient if you have to manipulate the data. The problem is nicely solved with the optionstringasis=T. -
Function
catcan be used to display numbers and characters without quotation marks.
Encoding
- Function
iconvis very useful for transforming codings between different encoding schemes.
Formatting
- Function
formatis very useful to display numbers and characters in the same length.
Regular Expression
- Sometimes we use
\\1to stand for the string in the first parenthesis inpatterargument. When we use\\11it the string in the first parenthesis inpatternargument followed by1, even if there're at least11parenthesis in the pattern argument. This means that while regular expression is convenient to work with strings, it's easy to make mistakes. It's a two-sided sword.