Network Analysis of Large Online Datasets

4 years, 8 months ago

(0 comments)

In late 2012, I spent an infinitely frustrating week trying to perform network analysis (specifically, build and run ergm models) on part of the bitcoin blockchain. I learned some things. If you are here because you want to run ergm models on large datasets, you're advised to back away slowly then run like hell. But here are some notes.

Resources

Coursera classes are great for this stuff! This semester they ran Social Network Analysis, and upcoming in April 2013 is Introduction to Data Science

There’s a new book out on data mining, based on a graduate-level CS class on the topic at Stanford: Rajaraman, A., Leskovec, J., & Ullman, J. D. (2012). Mining of Massive Datasets (p. 398). Retrieved from http://infolab.stanford.edu/~ullman/mmds/book.pdf

Skills

Most things can be learned if you’re determined and have a lot of time, but there’s also a point where it’s worth finding a research partner who can short-circuit that learning curve. If you don’t know anything about the following, you may want to find someone who does.

  • A scripting language - commonly Python, also Perl or PHP
  • A database management system (relational or noSQL, extremely large datasets will need the latter) and how to build a database and query it
  • Basic unix command line work, particularly if you want to try running things on a cloud computing service such as Amazon EC2

Getting the data

Find out about APIs for the service you’re interested in. Be aware that many APIs exist to allow integration with third-party clients and services rather than to provide data access, and this may limit the methods available.

Think about this early, especially any limitations the APIs might have about what you can get, and how much of it you can get at once (rate limits). You don’t want to be racing the clock to download your data.

ScraperWiki is a nice way to scrape data from the web if you don’t want to have to write too much code, and/or there isn’t an API available to do what you need. They also offer a service that will do your scraping for you for a fee.

R memory limitations

R stores all the data you’re currently working with (i.e. all the objects in your workspace) in RAM. If you have a lot of data - and network data, particularly in matrix representations, can get very large very fast - you’ll need to be aware of roughly how big your workspace objects are.

The dreaded “cannot allocate vector of size” error indicates you’ve run out of addressable memory. How much memory R has available depends on your operating system and architecture. On 32-bit Unix OSs, the limit is 3GB (or 4GB), while on Windows systems it’s 2GB. In practice, R may complain about objects of size half or greater than the total memory available to it. On 64-bit Unix OSs, you are in theory limited only by the size of the system memory.

How much memory does your object need?

How much memory you need for a network object depends on the size of the network (how many nodes, how many edges), data type of the matrix entries (boolean, integer, double), and whether the network can be efficiently stored using a sparse matrix representation. The igraph and network packages by default store networks this way. For the network package, memory requirements are of the order of the total number of nodes plus the total number of edges; igraph’s data storage may be even more efficient.

Most network analysis matrices are sparse, but snowball samples are a special case. The network package requires that unmeasured edges be recorded as NA (rather than 0, which would mean ‘measured by not present’). In a snowball sample, the ‘outer edge’ of your network is nodes that other nodes in the sample have a relationship with. You haven’t measured any edges from these nodes, only edges to them. The code to do this is a little non-intuitive:

for (i in 1:dim) { 
  if (outer_edge_of_snowball) {
    net3[,i,add.edges=TRUE]<-NA
  }
}

Recording all the required NA edges is pretty devastating for a large network with a lot of unmeasured edges. Estimate your memory requirements before you start to avoid a long-running script that looks like it’s coping until suddenly, 2 days later, it isn’t.

Another point to note: any computation on the adjacency matrix representation of your data is not using a sparse matrix data structure any more, and your memory requirements increase accordingly. An integer matrix of size n requires n*n*4 bytes.

Handling memory limitations

The most commonly recommended way to handle memory limitations in R is not to have all your data in memory at once. Use a database to store the full dataset, and query it to get just the ‘slice’ (a subset of cases or a subset of variables) you need to analyze. There are various connector libraries for R that allow you to query data stored in SQL or noSQl databases, or just output the result of the query to a .csv file for import into R.

If this isn’t feasible cloud computing services such as Amazon EC2 let you access more memory relatively cheaply. EC2 instance specifications are here, with pricing here. Choose a 64-bit Unix image to start, log in over ssh and install R, then copy your script to the server to run it. At $1.80 per hour for the double-extra-large 30GB instance, it’s worth a shot! Actually, in this range there are likely other issues limiting the amount of memory a single object can occupy, but some of the smaller instances allow you to run your analysis with double or triple the memory typically available on a 32-bit PC. More information on running R on EC2 here, here and here.

Worth knowing: the command to run a script that keeps running when you log out of the terminal is:
nohup R < infile.R > outfile.txt &
This is important even if you don’t plan to log out, as R’s propensity to 100% CPU usage if left unchecked can do weird things to your ssh session.

Comments

  • There are currently no comments

New Comment

required
required (not published)
optional