Economics should be open

August 14, 2009

Octave cell-arrays are pretty slow

Filed under: coding — howardchong @ 5:50 pm

I’m trying to figure out which open source statistical/computation package to use.  I used to use Matlab. It’s good, but expensive, and it has WAY more features than I need.

I know I should be running things on Unix, but right now I’m on Windows XP. I sometimes putty into a Unix server and run things.

R looks very good. That’s  my next langauge to learn.

Octave is pretty good. It provides syntax almost identical to Matlab.  In 3.0, it now has support for Multidimensional Cell Arrays. These are arrays that can hold any data type. Most common for me is an array of strings. If you load data that is mixed text and numeric, then your data will probably be read as a cell-array.

One thing I have noticed is that the cell-arrays are really quite slow.

I had a ~10000 x 10 csv file.

Column 1 had mixed numeric and strings. They were 6 character codes, and about 2/3 of them did not have alphabetical characters. I needed to convert these to strings, and then do a sort and some other processing. I basically had to traverse each element of the first row and do the datatype change individually.

The process was VERY slow. In fact, I think Excel would be better at such tasks.

Here are a few tips:

  • If you can, remove all strings from your CSV file.
  • If you read a large dataset as a large cell arrays, separate each column into its own variable. Then pack together the numeric data into a matrix (if needed).
  • STATA has an “encode” routine that converts strings into records stored as numeric. For example, if your data range is car makes, it will give each make a number and then also generate a lookup table where you can decipher what the numbers mean.

Also check out this page that benchmarks the math/science packages with a set of standard routines:

http://www.sciviews.org/benchmark/index.html

Advertisements

March 26, 2009

Sample regular expressions for stata

Filed under: coding, Stata — howardchong @ 10:22 pm

Just a note to show how to use regular expressions in stata for text processing.

PROBLEM: I had a lot of codes in variable dsmnem the had “:” and “.” characters. I wanted to do a reshape my data and use these strings as the j variable, i.e. “reshape … j( dsmnem) string”
SOLUTION: regular expressions

replace dsmnem=lower(regexr(dsmnem,”:”,”_”))
replace dsmnem=regexr(dsmnem,”\.”,””)

These two lines replace periods and colons with emptytext and underscore respectively. Note that I have to use the escape character to specify the period character; otherwise the period has a special meaning in the regular expression.

Weird how they call it “regexr” and not “regexp” or “regexpr”, but whatever.

By the way, dsmnem is datastream mnemomic

February 6, 2009

Traversing a directory in stata

Filed under: coding, Open Source, Stata — howardchong @ 12:17 am

I found a nice way to traverse a directory and load all the files in the directory. The key stata commands are to run a directory listing and output the list to a file. Then, you just have to use “levelsof” (or levels) to get your file names.

PROBLEM: Load a bunch of fixed-width TXT files in a directory without having to list all the file names.
SOLUTION:
STATA code

Unfortunately stata has some lame limits on the number of characters in a string (type “help limits”, 244 is the smallest limit), so this will break if you have too many files. You can probably fix this by using a matrix of strings, but I didn’t need to do that.

Another kludgy way of loading ALOT of files would be to not use levels of and each time open up the filelist.txt file and do something like “local filename_to_get=v1[`i’]” and loop over i=1 to numfiles.

January 8, 2009

file “outreg2_prf.ado” not found

Filed under: coding, Open Source, Stata — howardchong @ 1:03 am

So, I get the above stata error when using outreg2 which I install with “ssc install outreg2”.

This articles tells you (1) what I did to trigger the error message and (2) what steps I took to fix it.

UPDATE JUL2009: A good comment below suggests (from stata staff) that  it has to do with disk writes. So, that’s the best answer to date.

(more…)

December 16, 2008

Data cleaning, excel

Filed under: coding, Excel, Open Source, Uncategorized — Tags: , , , — howardchong @ 10:17 pm

Ever have data with commas between the thousand and million marks? spaces at the end of numbers? text footnotes appended to numbers, spaces in numbers?

Cleaning this by hand is a complete pain, so I wrote an Excel Macro.

To install, record a new macro and then stick this in the module.

There’s even a “careful” mode. (Change the line of careful=false to careful=true) which prompts you for each change that I felt might be wrong.

Please write a comment if you find it helpful.

 

(more…)

November 17, 2008

perl script for transposing Stata outreg2 output

Filed under: coding, Data Insights, Stata — Tags: , , — howardchong @ 10:38 pm

I’m using Stata’s outreg2 command and love it. But I run this look over 600 stocks. Excel doesn’t allow me to view 600 columns (Except in the newer version).  So, I need to transpose the outreg2 file. It’s too wide. Too many columns.

My former post on someone else’s perl script (https://opensourceeconomics.wordpress.com/2008/10/02/perl-script-for-transposing-csv/) actually doesn’t work correctly. I had to make two modifications, and the result is the perl script downloadable from here:

http://are.berkeley.edu/~chong/filesforblog/transpose_tsv_hc.pl

The two modifications are that 1) files are saved with tabs rather than commas. No big deal, I just changed the split operator and 2) the original script freaked out when there were blanks in the data. All blanks are ignored.

November 14, 2008

STATA: Generating a bunch of lagged variables

Filed under: coding, Stata — Tags: , — howardchong @ 10:23 pm

This small blog post is just a note on how to create a bunch of lagged variables using a simple forvalues loop.

de
* this gives you a list of your variables
foreach varname in varlist qqq - zzz {
* this says to generate lagged variables for all variables in the
* variable list between qqq and zzz
  forvalues i=1/9 {
  *generate 9 lagged values for each
     by date, sort: gen lag`i'`varname'=`varname'[_n-`i']
  }
}

so, if you have variables 10 varaibles between qqq and zzz inclusive, this script will generate 9 lagged variables for each.

October 2, 2008

stata transpose string variable without xpose

Filed under: coding, Data Insights, Stata — Tags: , , , , — howardchong @ 10:52 pm

So STATA will let you transpose the data with the xpose command, but this does not handle string data.

 

PROBLEM:

I had a set of stock price series. Variable names were data and the stock codes. rows were days

 

DATE  - STOCK1 - STOCK2 ... -STOCKN

1/1/2005   $1    $5   $10

...

12/31/2005 ...

 

So, I managed to do it as follows:

1) First, rename all stock variables “price”+STOCKNAME

foreach vn of varlist STOCK1-STOCKN {
  quiet: rename `vn’ price`vn’
}

2) reshape long
3) reshape wide

 

reshape long price, i(realdate) j(name) string
drop date
reshape wide price, i(name) j(realdate)

 

Note that I needed realdate to be an integer, so I ran a
gen realdate=date(datestr,”mdy”)
and then dropped date.
If I keep date as a string, I can’t have the slashes in the string variable name, so you do have to somehow convert it to something you want. You can replace the slashes with underscores and then add the “string” argument to the second reshape.

perl script for transposing csv

Filed under: coding, Data Insights — Tags: , — howardchong @ 12:05 am

 

 

I needed to tranpose a 600×5 csv (comma separated values) file so I could read it in Excel 2003.

Found what I needed here: http://biowhat.com/2007/01/14/getting-the-transpose-of-a-csv/

However, I did need to modify the code one bit. See the discussion below

 

Thanks for the script.

Just a comment though. You have the if condition:
elsif ($AoA[$j][$i] eq “”){
print RESULT “\n”;
last;

this ignore that all elements past j in $AoA[$j][$i].

That is, if you have any missing values that are coded as blanks, this imposes that blanks are afterwards. I think this is probably good for your dataset (you have streams of observations of different lengths (?))

Since my data has missing observations coded as blanks, I’m gonna remove this elseif condition.

As an example

a csv file with one line:
1, 2, 3, 4, 5, , 7, 8, 9

would be transposed to:
1
2
3
4
5

and the values after the blank would get dropped off.

Create a free website or blog at WordPress.com.