Input and Output in R
Posted on Nov 09, 2012 in Programming
Things under legendu.net/outdated are outdated technologies that the author does not plan to update any more. Please look for better alternatives.
read.table, read.csv, read.delim, etc.
-
It is suggested that you use
read.csv
/read.delim
and avoid usingread.table
. This is because that thecomment.char
is off forread.csv
/read.delim
and on forread.table
by default. My experience is that data seldom has comments in them and people are usually not aware thatcomment.char
is on forread.table
by default. When there are "comment chararacters" in data which are not intended to be comment characters, you screw up if you read data usingread.table
and forget to turn off comment usingcomment.char = ''
. -
By default, lines starting with
#
are treated as comment lines and thus are ignored when reading data usingread.table
, etc. If a line is not a comment line, avoid starting it with#
. Or you can specify a different comment character (e.g.,$
using optioncomment.char = "$"
). -
It is suggested that you always use
stringsAsFactors = FALSE
when constructing a data frame (data.frame, as.data.frame, read.table, read.csv, etc.). Factors causes more troubles than conveniences. If you do need factors (for building models), just manual convert columns to factors. -
If a data row have an extra column/field (e.g., due to missing quotes in CSV format), R will not throw error but instead treats the data to have a row names column. This results in shifted columns in some (shorter) rows. You can examining the format of a data file using the function
count.fields
which counts the number of fields in each line. -
read.csv
for standard CSV files whileread.csv2
is for European CSV format.
write.table, write.csv, write.delim, etc.
-
It is suggested that you write data frames into CSV format. CSV format has a clear definition. If you tell people that a file is in CSV format, they know how to read it. But if you tell people that a file is in a general delimited format, people have to ask about what the delimiter is and whether fields are quoted, etc.
-
The function
write.table
writes a data frame or a matrix into a file. Note that it can also append data into a file.
x = matrix(1:24, nrow = 6)
colnames(x) = paste0("x", 1:4)
write.table(x, 'out.txt')
x[, ] = 0
write.table(x, 'out.txt', append = T, col.names = F)
-
By default
write.table
output missing values as character NA (na = "NA"
). It is suggested that you output missing values as empty strings (na = "") as it is more portable. Other programming languages do not recognize characterNA's
as missing values. -
Always quote fields when you write data into CSV format.
-
row.names
have different means inread.table
andwrite.table
.row.names
is a logical variable indicating whether you want to output row names or not inwrite.table
, however, it is not a logical variable indicating whether there is a row names in the data. It is much more complicated. Please refer to the R help doc for detailed explanation. -
It is suggested that you never write row names into files. If row names contains useful informtion, write row names into file as a column/field. The reason is that row names causes troubles in IO. On one hand, it is not convenient to read in a file with row names into other programming language. On another hand, row names causes side effect in R too.
MS Excel
http://www.thertrader.com/2014/02/11/a-million-ways-to-connect-r-and-excel/ It is suggested that you avoid using Excel as input/ouput data format. CSV is a better alternative. However, if you do have to use Excel as input/output data format, read the following tips.
-
There are lots of ways to read data from and write data into Excel documents. For example, packages
xlsReadWrite
,xlsx
,RODBC
(and many more) all offers ways to import data from and export data to Excel documents.xlsx
(which offersread.xlsx
andwrite.xlsx
) is good package for dealing with Excel spreadsheet.RODBC
is a universal way to deal with all kinds of databases (not just Excel spreadsheet). -
It is usually very slow to read in (or write to) a large Excel spreadsheet. It is suggested that you convert large Excel spreadsheets to CSV files first and then read in them. Also, write data into CSV files instead of Excel spreadsheet. However, be careful that the stupid Excel might loss information when converting to CSV format. Generally speaking, this happens when there are very long numbers.
-
Excel might scilently format opened CSV (or imported text) files. Generally speaking, this happens when there are very long numbers in the text file. Be careful not to introduce undesired changed. If you just open a CSV file to view it and do not want change its content, then just discard any changes Excel has made. If you indeed want to change CSV file in Excel, keep your fingers cross.
Binary Data
- The function
readBin
reads in binary data and the functionwriteBin
writes binary data into files. Both of the two functions have limits on the size of data that they can deal at a time. For large binary data, you have to usereadBin/writeBin
multiple times to read/write them from/to files.readBin
andwriteBin
are compatible withfwrite
andfread
in MATLAB.fread
andfwrite
are C style functions, ... (to be checked for compatible with c/c++). However,readBin
andwriteBin
is not compatible with Java classesDataOutputStream
andDataInputStream
. To read binary data written usingDataOutputStream
from Java, you can call Java code for reading binary data usingrJava
in R; to write binary data that is recognized byDataInputStream
in Java, you can call Java code for writing binary data usingrJava
in R.
Misc
- All most all input/output functions in R support reading data from all kinds of source including files, console, clipboard and website. For example, if you have copied a block of data from an Excel document, you can read it into R using the following command.
x = read.table('clipboard', sep = '\t')
This is good and quick way to import part of the data from a Excel document into in R.
You do not have to worry about formula in cells.
When you read data from Excel, values (instead formulas) are read in.
If the data is on a website,
you can just pass the URL to read.table
(or read.delim
, read.csv
, etc.) to read it.
-
print
prints an R object to the R console andcat
can print multiple objects to the R console, clipboard or files. There is some difference between the outputs generated by these two functions. Generally speaking, if you want to see the content of an object in R scenario (keep special characters as they are), you want to useprint
; if you want to see the content in human readable format (special characters are translated), you want to use functioncat
. I have to mention thatcat
doesn't work for all types of R objects (e.g. objects ofxtable
). -
The function
scan
is extremely powerful and flexible. You can skip firstk
lines using the optionskip = k
. For example, the following command skips the first 5 lines.
scan(what = double(), skip = 5)
You can skip lines starting with special characters using the option comment.char
.
For example,
the following command skips lines starting with "#".
scan(what = double(), comment.char = "#")
You can decide which special strings are treated as missing values using the option na.string
.
For example,
the following command treats "NA" as missing values.
scan(what = double(), na.string = "NA")
You can limit the number of data values/lines to be read in, etc.
Many input functions in R are based on the function scan
,
for example, read.table
.
These functions inherits the powerfulness and flexibility of scan
.
scan
is also helpful for interactive programming.
For example, you can read a string into the variable input
using the following commmand.
scan(what = character(), n = 1) -> input
However, scan
is used to read in a vector,
so the data to be read in at the same time must be of the same type.