String in R
Posted on Nov 20, 2012 in Programming
Things under legendu.net/outdated are outdated technologies that the author does not plan to update any more. Please look for better alternatives.
** Things under legendu.net/outdated are outdated technologies that the author does not plan to update any more. Please look for better alternatives. **
The R package stringi
is a great one.
It is suggested you used string functions in the stringi
package
rather than in the base package (grep
, sub
, etc.) when possible.
Functions in the "stringi" Package
stri_trans_totitle
%s+%
pastes 2 strings together. Vector is opteration is supported. However, there is one difference between%s+%
andpaste
. Applying%s+%
onNA
returnsNA
while paste treatsNA
as an empty string. Actually allstri_*
functions respectNA
propagation, that is applying anystri_*
function onNA
returnsNA
.
Misc
-
R 3.3.0 and above: validUTF8(x)
-
Function
chartr
can be used to substitute old characters to new characters. If new character is null then this functions can be used to drop characters from the original string. -
Function
nchar
can be used to get the length of characters while functionlength
can only be used to get the length of vectors or matrixs. -
For special characters, we can put a backslash before it to get it, or I think we can put it into a pair of single quotation mark.
-
R function
paste
can concatenate several vectors and it can also concatenate the elements in a vector. -
Using function
expression
andeval
we can achieve symbolic computation. In addition, functionparse
might be useful. -
Function
match.arg
can be used to partially match strings. -
To replace a substring of a string with a new one, you can use
sub
orgsub
. The difference betweensub
andgsub
is thatsub
only replaces the first occurrence whilegsub
replaces all occurrences. However, sometimes, you might want to replace substring by index which gives more accurate control of substituting. For example, when there are multiple occurrences of a substring and you only want to replace the 2nd one, then neithersub
norgsub
works well here.substr
andsubstring
are good alternatives in this situation. These two functions work similarly to vectors and matrices, which means that you can use these two functions to both extract and replace substrings. If you want to replace an element of a vector/matrix to a new one, you can just assign a new value to the element. Similarly to replace substring of an object string usingsubstr
orsubstring
, you can simply assign a new value for the substring. However,substr
andsubstring
can only replace a substring with the same length, if the argumentreplacement
is not long enough, then only partial of the substring specified will be replaced; if the argumentreplacement
is too long, then it will be truncated to have the same length with the substring to be replaced. If you want to replace a substring specified by index with any new string, you can usedclong.String::strReplace
. -
strwidth
calculates the width of a string when displayed on a graphics device. -
strsplit
splits a string according to specified delimiters. strsplitis based on regular expression by default. You can use literal string by specifying the option
fixed = TRUE(similar to other regular expression functions). For example, you can use the following code to split
NCO_MTG_Per_L1 + LHPIRT_b + URD_b` using the plus sign.
strsplit("NCO_MTG_Per_L1 + LHPIRT_b + URD_b", "+", fixed=TRUE)
Notice that splitting an empty string results in an empty string, which is not a good behavior. Returning an empty string is better.
> strsplit(c("", "1:2:3"), ":")
[[1]]
character(0)
[[2]]
[1] "1" "2" "3"
If none of the strings end with the delimiter, then a trick to resolve the issue is to add an extra delimiter to each string before splitting.
> strsplit(c(":", "1:2:3"), ":")
[[1]]
[1] ""
[[2]]
[1] "1" "2" "3"
IO
-
By default, characters read into R by
read.table
(and alike functions) is converted to factors. While it makes modeling convenient in R, it's usually inconvenient if you have to manipulate the data. The problem is nicely solved with the optionstringasis=T
. -
Function
cat
can be used to display numbers and characters without quotation marks.
Encoding
- Function
iconv
is very useful for transforming codings between different encoding schemes.
Formatting
- Function
format
is very useful to display numbers and characters in the same length.
Regular Expression
- Sometimes we use
\\1
to stand for the string in the first parenthesis inpatter
argument. When we use\\11
it the string in the first parenthesis inpattern
argument followed by1
, even if there're at least11
parenthesis in the pattern argument. This means that while regular expression is convenient to work with strings, it's easy to make mistakes. It's a two-sided sword.