Blog | doorinthewall.co.za

LoRA + GPS trackers for Burning Man

2017-05-13T00:29:16+00:00

A pair of homebrew trackers for off-the-grid location sharing, using Adafruit's Feather boards. Hardware mostly me, software mostly Drew. Built (in a rush) for Burning Man 2016, where the perfectly circular city layout makes it possible to accurately convert GPS co-ordinates to a street address (e.g. 4.15 & J-25m). For Afrikaburn, which does not use a perfectly circular layout due to terrain, we used a radial system, choosing a centre point and origin bearing and displaying a bearing as clock time and distance in metres (e.g. 9.45 + 600m).

Parts list

Adafruit Feather 32u4 RFM95 LoRa Radio - 868 or 915 MHz
Adafruit Ultimate GPS FeatherWing
FeatherWing OLED
LoRa Antenna Kit + uFL SMT connector (important! not included!)
Lithium Ion Battery Pack - 3.7V 4400mAh
Small hard plastic case (Muji)

Code

https://github.com/mloudon/electrosew/blob/master/feather_send/feather_send.ino

Notes

Batteries lasted a full week, with several hours of use each night, although we did tend to keep them switched off when both trackers were together.
One of the things we plan to do at Burning Man this year is a proper range test. The first version used a wire antenna, and the range (in the city) seemed around 800m. The second version, with the external antenna, received a fix around 20km away! The reliability requirement for these is quite low - one fix every few minutes is probably fine - and at this level the range covered the entire Afrikaburn site.
The 32u4 Feather boards have 32K of flash, which even with as much optimizing as we could do in limited time wasn't enough to use Adafruit's recommended GPS and packet radio libraries together. We got around this by using TinyGPS, the M0 Feathers have 256K, and currently the price is the same, so this is almost definitely the better choice.
In testing, I found that switching from send to receive took the hardware a little time. If your radios aren't talking to eachother, try adding a short delay.
For some reason, the OLED doesn't come on on startup - you need to press the reset button once first. Shrug.

Heartbeat Choker

2015-10-08T14:41:55+00:00

A long-neglected arduino project using Adafruit's Flora + sewable neopixels, and the Pulse Sensor Amped.

The only thing I really had to work out in the code (sample here) was the timer settings to use - against everyone's good advice, only Timer0 seemed to work reliably on the Flora.

The biggest part of this project was actually figuring out a comfortable way to carry the battery on my body, without wires getting in the way or any weight being carried by the choker itself. Pictures below of the result, a somewhat elaborate harness made from elasticated ribbon and o-rings. For a bigger battery, plain elastic would work too - the main thing is that the battery is held in place by multiple straps, so it moves with the body and doesn't swing like a backpack would.

Testing pic. Not pictured: various testing activities attempting to raise my heart rate for display.

2014 elections party results maps

2014-05-14T03:22:20+00:00

More fun with IEC elections results data (and maps). The images below show party results (% of votes) by district for a couple of parties. Made using QGIS - I might get a better map up with TileMill soon.

Results data this zipfile, spatial data this shapefile. The IEC should eventually release and official results dataset; this is not it.

Click the image for the larger version.

ANC

DA

EFF

AGANG SA

IFP

NFP

I'm not pretening to be able to make a coherent argument with these, but in general, there seems like nothing too surprising here. The DA is strong in the Western Cape and Gauteng, the ANC in the Eastern half of the country. The IFP and NFP fight it out overy very little territory. AGANG SA did poorly almost everywhere, with a bit of support in the Western Cape and Gauteng and along the Garden Route. The EFF did seem to do relatively better in the mining areas of Rustenberg and Thabazimbi, i.e. the area around the Marikana mine.

Voter Turnout in the 2014 South African Elections

2014-05-13T19:09:27+00:00

Earlier this year, the IEC released some mobile apps, and an API with full access to election results as they become available. The API has some problems (notably, it requires a username and password which you have to *cough* get from somewhere) but it feels like a step in the right direction. At the moment there isn't any way to get results for all voting districts - rather, they have be requested per district. I used Code4SA's 'IEC API API' and a list of voting districts found in this SQL dump to pull down the results, and I'm sharing the data in case anyone is interested.

This zip file contains csv files of results by party (number of votes and percentage of total) for each voting district, as well as data on voter turnout, special votes, and total registered voters.

This shapefile contains the voting districts. I converted it from the SQL dump, use at your own risk.

Getting the data took a while and involved some acrobatics, which I'll post about when I get a chance. I made a voter turnout map as a quick attempt to do something with it, and I think it turned out pretty well! Mapbox is awesome.

Reddit and the SOPA blackouts: Exploring the problem space with Wikipedia links

2014-04-23T14:38:33+00:00

I've had a dataset kicking around for a while on Reddit's resistance to SOPA: 766 Reddit posts and associated comments, obtained through Reddit's API, covering from late 2011 until the day the of the blackouts, 18 January 2012. I've tried a couple of ways to analyse it, mostly thinking from a social movement perspective about how a coherent and ultimately successful resistance emerged.

The first thing I did was read them all, and try out some theory on mobilisation in cases of scientific/technical controversy. The resulting paper came out in First Monday a couple of weeks ago. I sorted the data to look at the top posts per subreddit, top self-posts, and commonly linked sites and domains, which helped to figure out what was important, but it started out as a term paper and I didn't really have time for more quantitative analysis.

During that summer, I proposed a project using named entity recognition on the comments to obtain people and companies that Reddit users singled out as part of various resistance strategies. For example, there was a boycott of GoDaddy, as well as other companies seen to be supporting SOPA, and various congressional representatives were discussed either as potential allies against the bill, or supporters of it who were suitable targets of ridicule for their evident technologically ineptness.

This was partly a ploy to be allowed to take the fantastic Stanford online Natural Language Processing class for credit. Fortunately it worked, because the named entity recognition (using nltk and only just enough knowledge to be dangerous) certainly didn't. Reddit comments, like tweets, lack context information (often even simple things like capitalization are missing), have many different kinds of named entities, and have relatively few named entities in general.

I tried a few other things here and there, mostly as a temporary dissertation avoidance strategy. A little more python allowed me to extract all urls in comments, which I dumped into R to try some network analysis on. Hyperlink network analysis can sometimes produce interesting results, but it doesn't work as well when a lot of the links in your network are to media hosting sites (like youtube) or posts on social networks (facebook, twitter, reddit). In a hyperlink network created by a crawler such as IssueCrawler, domains are assumed to have some meaning, through which there is meaning in how they link to each other. Unfortunately, my dataset (and probably most hyperlink networks these days) was dominated by 'platform' sites that host extremely diverse content. The top 10 domains in the dataset include Wikipedia, Reddit, YouTube, imgur, Twitter and Facebook. Tech news muckracking site Techdirt does make an appearance, as do the Sunlight Foundation's government transparency site, opencongress.org, and thomas.loc.gov, the Library of Congress site for legislative information. This gives a bit of a sense of what kinds of information people were linking to, but doesn't inspire confidence in the ability of a scraping operation to meaningfully connect domains.

Instead of scraping hyperlinks to connect domains, you could also create a network by connecting domains that are mentioned in the same comment thread, or in comments in the same subreddit. Predictably, the first is extremely sparse, and the second is dominated by the same platforms that appear frequently across domains. For completeness, here's a quick visualization made using the sna and network packages in R. Way over there on the left is Canada, specifically the Parliament of Canada, and the Canadian Coalition for Electronic Rights.

A couple of weeks later, I wondered whether you could classify the kinds of evidence (and the kinds of targets) being deployed by Reddit users in comment threads. For example, government - specifically members of congress - were clearly seen as more fruitful targets for influence than anyone in the media industries, and actions were oriented toward influencing the formal political process. I drew up 10 categories - government sites, non-profits, media or tech companies, mainstream and specialist (usually legal or tech) news sites - and asked workers on Mechanical Turk to classify each domain.

This was a complete failure, and (previous bad experiences with Mechanical Turk having lead me to do a test run first with a very small sample) a waste of $4. Mechanical Turk works best when it's easier to choose the right answer than the wrong one. Looking at the error pattern, many of my workers chose to completely disregard the instruction to visit the site, instead guessing based on the top-level domain. I considered looking elsewhere and paying more for better microwork, but concluded the potential findings weren't interesting enough to warrant it.

FINALLY, now quite keen to put this project to bed (and having managed to get my original paper published without additional quantitative work), I hit on the idea of using the Wikipedia links to characterize the problem space. This is actually a pretty common approach. Because wikipedia not only explains concepts but links and categorizes them, it has been used to calculate term similarity, expand queries to suggest related results, and otherwise help computers process and use unstructured texts. This paper has an exhaustive review of work using Wikipedia, mostly in computer science.

In the end, I didn't pursue links between Wikipedia entries, although this would have been a fun project. Instead, I created two word clouds, one of frequently linked-to topics, and one of their categories. Category information was obtained from the Wikipedia API using wikitools (more python). Word clouds were created with the impressive d3-based d3-cloud javascript library. Word clouds are a bit tricky to use for multi-word topics, but I think they came out ok. Click here for the large version of the categories cloud, here for the large version of the pages cloud.

Most frequently linked Wikipedia categories

(larger version)

This word cloud shows all categories appearing 5 or more times in the dataset. The cutoff was chosen simply to be able to produce a reasonably legible word cloud. Four obvious umbrella categories - Living Persons, 2011 in the United States, 2012 in the United States, 1998 in the United States, 1947 births and 1945 births - were removed.

Most frequently linked Wikipedia topics

(larger version)

This word cloud shows all topics appearing 3 or more times in the dataset. Again, the cutoff is arbitrary - word clouds work best with a small number of words.

Open Access to ICTD Research

2013-12-12T10:20:20+00:00

This post aimed at authors whose work appears in the proceedings of ICTD 2013 in Cape Town. The proceedings aren't open access, but YOU can break through the paywall!

During Tuesday's open session, title "Appropriating ICTs for Developing Critical Consciousness and Structural Social Change" or #punkICT4D, we discussed ways to make research products more accessible to people with no or limited access to journal subscriptions through university libraries. This post is to remind authors of papers or notes for ICTD 2013 that the ACM allows authors to either:

Post a copy of the accepted version on a personal webpage or in an institutional repository
Get a link than enables free downloads of your work (via ACM Author-Izer).

For more ways to share your research, see this post. It was written as part of a tribute to open access activist Aaron Swartz, in which authors posted PDFs of their work with the hashtag #PDFtribute. They're collected here. Aaron's Guerilla Open Access Manifesto is quoted below.

Forcing academics to pay money to read the work of their colleagues? Scanning entire libraries but only allowing the folks at Google to read them? Providing scientific articles to those at elite universities in the First World, but not to children in the Global South? It's outrageous and unacceptable.

"I agree," many say, "but what can we do? The companies hold the copyrights, they make enormous amounts of money by charging for access, and it's perfectly legal — there's nothing we can do to stop them." But there is something we can, something that's already being done: we can fight back.

Those with access to these resources — students, librarians, scientists — you have been given a privilege. You get to feed at this banquet of knowledge while the rest of the world is locked out. But you need not — indeed, morally, you cannot — keep this privilege for yourselves. You have a duty to share it with the world. And you have: trading passwords with colleagues, filling download requests for friends.

Meanwhile, those who have been locked out are not standing idly by. You have been sneaking through holes and climbing over fences, liberating the information locked up by the publishers and sharing them with your friends.
But all of this action goes on in the dark, hidden underground. It's called stealing or piracy, as if sharing a wealth of knowledge were the moral equivalent of plundering a ship and murdering its crew. But sharing isn't immoral — it's a moral imperative. Only those blinded by greed would refuse to let a friend make a copy.

ICTD 2013 Short Paper: Mobiles and Migration

2013-12-05T11:16:32+00:00

I've got a short paper in ICTD 2013. The title is "Mobiles and Migration: Global Data on Immigrant Population and Mobile Subscriptions" and the abstract is below. You can also download the data, R code, draft paper and the bubble chart I used on my poster.

Abstract

Expanding on recent findings on mobile phones as enablers of domestic labor mobility, this paper considers the relationship between immigrant population and mobile cellular subscriptions. Regression analysis shows that the immigrant proportion of a country's population significantly predicts mobile cellular subscriptions per 100 people, controlling for GDP (p < .01). Further, the model explains 24% of the variance in the dependent variable, compared to only 15% for GDP alone.

Let them eat cellphones?

2013-11-12T12:56:26+00:00

On Monday, the New York Times published Ubiquitous Across Globe, Cellphones Have Become Tool for Doing Good. That a technology could be a tool for doing good is a fairly modest claim, and has been made about probably every new communication technology since the printing press. (For more basic claims about new communication technologies and/or a game of ‘seen in the NYT’ bingo see this handy cheat sheet that appeared in the webcomic xkcd on the same day. Certainly, Snapchat et al demonstrate that yes, teens will indeed use <insert new communications technology here> for sex). Technical solutions to problems of poverty and underdevelopment are particularly popular, not least because they require exactly no soul-searching on the part of the West.

Given the low bar thus established, one would expect the claim that cellphones have been used for doing good to be quite easy to support. Indeed the article does just that, with many pleasingly uplifting examples. Healthcare workers use cheap featurephones to monitor mother and child health and track polio vaccination campaigns. Indian mobile users can subscribe to daily informational SMSs on a range of health issues. Residents of Kibera, a dense informal settlement outside Nairobi, can find water sellers or advertise water for sale on their phone, while in Kenya and South Africa, a credit score can be procured by providing detailed daily expenditure and income data by SMS.

In each example, cellphone apps and services are meeting a clearly defined need. Better healthcare, clean water and access to credit are unquestionably important. Furthermore, cellphones are widespread, and we have quite a bit of evidence that phone ownership and use can measurably improve the lives of the poor. But the vision of development exemplified by cellphone apps and services “Western-inspired but designed for people making $2 a day” bears examining. After all, shouldn’t we be more concerned that the residents of Kibera, many unemployed, caring for young children and in precarious health themselves, are forced to pay for clean water?

At this point, the emperor’s clothes begin to come off. First, the statistic that 96% of the the world “is connected via cellphone” is grossly misleading. While the total number of mobile subscriptions may be equal to 96% of the world’s total population, a mobile subscription is in no way a proxy for a “connected” individual. Rather, every device that connects to a mobile network (phones, tablets, mobile Internet devices) counts as one subscription, and users with a subscription to more than one network - extremely common in prepaid markets, where in-network calling is cheaper - are counted twice. The ITU arrives at the 96% statistic by averaging penetration rates that vary from 170% in the former Soviet Union, to 63% in Africa. Aid agencies with a charitable purpose, like USAID, might consider focusing on those who aren’t connected rather than assuming hyperconnectivity averages things out, which is a bit like saying that between gluttony in the US and starvation in Ethiopia, everyone has about enough to eat.

Of course, not all the examples rely on individual ownership of mobile devices. m-Health projects in particular often provide cellphones to health workers to extend the reach of e-Health initiatives, which have been going since the mid-1990s in some developing countries. There are some promising results for these kinds of project, but most relate to pilots or small-scale implementations. Evidence of impacts from country-level, interoperable e-Health systems in the developing world remains elusive in part because such research is extremely expensive, and cost-benefit studies are even harder to come by. Cellphones can help. They are cheap, easier to use than a computer for many people, and increasingly computationally advanced. At the same time, few in m-health would argue that they are a transformative technology, simply because health impacts depend on so many other things within and beyond the health system.

Those mobile services that do reach poor individuals are still quite far from the kinds of apps that Nandu Madhava, who is quoted commenting that Android smartphones are “very affordable for the poor” is probably imagining. They require the user to request, receive or publish information using numeric codes (Unstructured Supplementary Service Data - USSD), or in 160-character (70 in languages with non-latin scripts) short text messages (SMS). Such systems work surprisingly well given that SMS was never designed for composition on a mobile phone. Rather, it was intended as a paging system, with mobile users receiving the messages from their network. In part because of language problems, SMS use is variable among the poor. A representative survey conducted in Bangladesh, Pakistan, India, Sri Lanka and Thailand found that 32.2% of mobile owners in socioeconomic groups defined as the “bottom of the pyramid” had ever used SMS, and they tended to be younger, more urban and more educated than non-users. Furthermore, a recent study on a sexual health information SMS service delivered Google, the Grameen Foundation and MTN in Uganda cautions against conflating better information with health behaviour change. The study found no change in norms or behaviour around risky sex, in part because the SMS was not seen as a trusted information source in the same way advice from a health worker might have been.

The quote by Hilmi Quraishi, who describes the choice of mobile apps and services as “local technology to be self-reliant”, is the article’s final irony. With the exception of ZMQ (Russian), all the examples cited are either developed or funded from the US. Including Kenya’s m-Pesa, a mobile money transfer system with 11 million users, or 25% of the country’s population, might have balanced things out a bit. Or, perhaps, asking whether all those smartphones have really done much to save America’s public school system or deliver universal healthcare.

The truth, as Indian tech and social change commentator Anita Gurumurthy points out, is that “network capitalism is embedded within the technological DNA of mobile telephony”. Around the world, mobile network operators - multinationals with huge profits - have fought tooth and nail to avoid regulation that would make mobile communications more affordable for the poor. They are even less keen to let third party services operate on their networks, unless they get a cut. Some, as Vodafone did in Egypt when it sent pro-Mubarak messages and then shut the network down, cheerily accede to the demands of repressive regimes, while many more (hello, Verizon) go beyond the call of duty in spying on their customers.

The apps and services discussed have been developed in response to real problems by well-intentioned, highly skilled and innovative companies. It is fair to say they are doing something good. However, writing laudatory trend pieces about “Western-inspired” mobile apps and services in the developing world is like being proud of throwing the poor some table scraps. Cellphones may be used to do good, but do they really do enough?

* Drew obviously works for Dimagi, an m-health company mentioned in the article. Our marriage thrives on healthy disagreement.

Deleting media files on Gondor

2013-08-29T15:02:29+00:00

The critical media project is hosted on Gondor. It's generally been pretty smooth, but the documentation is a bit ad-hoc in places. I think they've been feeling out the PaaS thing at the same time as I have.

I needed to delete a file from the writeable storage. Gondor's options for interacting with storage are rsync, scp or ls over ssh, but you can't access it via ssh directly. The solution I used involved creating a dummy empty directory locally, then syncing it with the writeable storage dir with only the target file included. Thanks, stackoverflow.

First, upload your ssh key to Gondor. The, using your instance ID - found in the url of the instance detail page - do:

ssh <instance_id>@ssh.gondor.io ls <parent_dir>

mkdir mock

rsync -rv --delete --include="<filename>" '--exclude=*' mock/ <instance_id@ssh.gondor.io:<parent_dir>

ssh <instance_id>@ssh.gondor.io ls <parent_dir>

Are technology innovations "changing Africa"?

2013-08-23T01:13:42+00:00

In response to this post on facebook, original source here.

I want this to be true, and in some ways it is. But in others, it really, really isn't. m-Health with a limping health system (m-health impact and cost-benefit analyses in general)? mobile money in places that are not Kenya, Nigeria, and a few others? Ushahidi's pet project and the OLPC as local innovations? That mxit statistic, which should be more like 6.5 million active users? None of these things are without merit, but they don't add up to "innovations changing Africa" except in quite a limited sense of change.

Replacing Africa, land of rape and lions with Africa, land of technology! and innovation! is awkwardly simplistic - and, depending on how you feel about aid spent on technology and not on other things, also materially consequential. Africa, very big and understandably pretty complicated?

Also, that map of innovations 'taking over' the geographical extent of Africa is.. not a good visual.

Latitude Public Location Widget

2013-06-05T20:59:01+00:00

update nov 2013: uh, well, never mind. latitude is gone for good. the track your friends part of things has moved to g+, but a) no and b) there's no API anyway. i ended up rolling my own replacement using little fluffy location library for Android

On the homepage of this site is a Google Latitude badge showing where I am (or, if my phone isn't on or updating things, where I was some time ago). Getting your location (as well as location history data and other things) from the Google Latitude API requires oauth, and being as this was meant to be a quick afternoon project, this wasn't something I wanted to get into. Fortunately, the Google Latitude Public Location Badge, once enabled for either accurate or city-level location, can be used to get a location, its reverse-geocoded place name, and a timestamp.

The usual way to use this is to embed the "badge" html, which looks ok, but I wanted to control the way my badge looked and the way the timestamp was displayed. Fortunately, all the data in the badge is also available as json. I wrote a Django templatetag that requests the json and returns the attributes (lat/lon, placename, timestamp) used in the badge. I then use this to get a map from the Google Maps static API. You can see the templatetag code here.

Also featured: Django 1.4's humanize templatetags, in this case dayssince to give days since I was last seen (Today, Yesterday, or otherwise a date). There are a couple of other useful tags in this library, displaying things like human-friendly numbers (one, two, 1 million).

Checking for Spatial Clustering of Points in R

2013-06-05T01:15:07+00:00

The R script below tests whether a set of points on a plane is significantly clustered (as opposed to randomly distributed or, conversely, equally spaced), using the Clark-Evans R measure. From the R spatstat package manual "[Clark-Evans R] is the ratio of the observed mean nearest neighbour distance in the pattern to that expected for a Poisson point process of the same intensity. A value R>1 suggests ordering, while R<1 suggests clustering."

If you have points on a sphere you'll have to reproject - as far as I can tell there's no way to use spherical distance formulas. You'll also want to use a different measure to check for spatial clustering with respect to an attribute of the data points, for example Getis Ord G.

The spatstat package requires a window (essentially a polygon bounding box, and in this example a shapefile of the contiguous US is used. Because the edges of the data are a coastline (i.e. physically significant rather than an arbitrary data limit), I'm not using the edge-correct R in the results. The test gives a p-value, which tells you how significantly clustered/ordered your data is.

coords97<-read.csv("1997CoCoords.csv")

install.packages("deldir")
install.packages("maptools")
install.packages("spatstat_1.31-1.tar.gz", repos = NULL, type="source")

library(spatstat)
library(maptools)

x<-readShapeSpatial("nos80k.shp")
y<-as(x, "SpatialPolygons")
spatstat.options(checkpolygons=FALSE)
usawin<-as.owin(y)
spatstat.options(checkpolygons=TRUE)

points97<-ppp(coords97[,2],coords97[,3], window=usawin)

ce97<-clarkevans(points97)
ce97test<-clarkevans.test(points97, alternative="clustered")

SaferMobile - Mobile Security Resources

2013-06-04T22:40:23+00:00

Safer Mobile was a project of MobileActive.org, publishing tactical and software resources to help activists use mobile technology more securely. As part of this project, I wrote training guides on Facebook, Twitter and SMS security, encryption and circumvention tools for mobile browsing, and mobile app security. I was also a primary author on the Safer Mobile mobile security training guide. My other writing for MobileActive included guides on how to set up an SMS system, mobile tools for data collection, Interactive Voice Response (IVR) system selection.

Unfortunately, in late 2012 both the Safer Mobile and MobileActive.org sites were taken offline, along with all the material posted there. Some of the team for the Safer Mobile materials has collected drafts of the articles and training guides in a Git repository here. More information, including the license for the materials, is here. I'll also be posting a few unpublished articles on this site when I get a chance to update them - as mentioned in the link, mobile security threats change fast, and this material was written in 2011-2012.

Critical Media Project

2013-06-03T23:05:29+00:00

The Critical Media Project collects curated multimedia content for use in media literacy education in the areas of race and ethnicity, class gender and sexuality. The site hasn't launched yet, but you can see some example of media content and discussion questions here.

It's built using Mezzanine, the same great Django CMS that runs this site. Media artifacts are custom content types. Wistia provides media hosting (for images, sound, and documents in addition to video, which lots of other services don't), and it's integrated via oembed. The site is hosted on Gondor, which made it really easy to deploy new things for review by a distributed team. Good tools make all the difference!

USC Bike Commuters

2013-06-02T22:52:10+00:00

USC Bike Commuters was one of my first Django projects. It's a community networking site for bike commuters at the University of Southern California. The idea is that people can put the start of their commute on a shared map and connect with others who might bike the same route. The basic social network stuff (user accounts, messaging) uses Pinax. It also has an AJAX-y social media 'wall' for announcements, questions and requests.

Network Analysis of Large Online Datasets

2013-05-26T19:57:30+00:00

In late 2012, I spent an infinitely frustrating week trying to perform network analysis (specifically, build and run ergm models) on part of the bitcoin blockchain. I learned some things. If you are here because you want to run ergm models on large datasets, you're advised to back away slowly then run like hell. But here are some notes.

Resources

Coursera classes are great for this stuff! This semester they ran Social Network Analysis, and upcoming in April 2013 is Introduction to Data Science

There’s a new book out on data mining, based on a graduate-level CS class on the topic at Stanford: Rajaraman, A., Leskovec, J., & Ullman, J. D. (2012). Mining of Massive Datasets (p. 398). Retrieved from http://infolab.stanford.edu/~ullman/mmds/book.pdf

Skills

Most things can be learned if you’re determined and have a lot of time, but there’s also a point where it’s worth finding a research partner who can short-circuit that learning curve. If you don’t know anything about the following, you may want to find someone who does.

A scripting language - commonly Python, also Perl or PHP
A database management system (relational or noSQL, extremely large datasets will need the latter) and how to build a database and query it
Basic unix command line work, particularly if you want to try running things on a cloud computing service such as Amazon EC2

Getting the data

Find out about APIs for the service you’re interested in. Be aware that many APIs exist to allow integration with third-party clients and services rather than to provide data access, and this may limit the methods available.

ProgrammableWeb API list

Think about this early, especially any limitations the APIs might have about what you can get, and how much of it you can get at once (rate limits). You don’t want to be racing the clock to download your data.

ScraperWiki is a nice way to scrape data from the web if you don’t want to have to write too much code, and/or there isn’t an API available to do what you need. They also offer a service that will do your scraping for you for a fee.

R memory limitations

R stores all the data you’re currently working with (i.e. all the objects in your workspace) in RAM. If you have a lot of data - and network data, particularly in matrix representations, can get very large very fast - you’ll need to be aware of roughly how big your workspace objects are.

The dreaded “cannot allocate vector of size” error indicates you’ve run out of addressable memory. How much memory R has available depends on your operating system and architecture. On 32-bit Unix OSs, the limit is 3GB (or 4GB), while on Windows systems it’s 2GB. In practice, R may complain about objects of size half or greater than the total memory available to it. On 64-bit Unix OSs, you are in theory limited only by the size of the system memory.

How much memory does your object need?

How much memory you need for a network object depends on the size of the network (how many nodes, how many edges), data type of the matrix entries (boolean, integer, double), and whether the network can be efficiently stored using a sparse matrix representation. The igraph and network packages by default store networks this way. For the network package, memory requirements are of the order of the total number of nodes plus the total number of edges; igraph’s data storage may be even more efficient.

Most network analysis matrices are sparse, but snowball samples are a special case. The network package requires that unmeasured edges be recorded as NA (rather than 0, which would mean ‘measured by not present’). In a snowball sample, the ‘outer edge’ of your network is nodes that other nodes in the sample have a relationship with. You haven’t measured any edges from these nodes, only edges to them. The code to do this is a little non-intuitive:

for (i in 1:dim) { 
  if (outer_edge_of_snowball) {
    net3[,i,add.edges=TRUE]<-NA
  }
}

Recording all the required NA edges is pretty devastating for a large network with a lot of unmeasured edges. Estimate your memory requirements before you start to avoid a long-running script that looks like it’s coping until suddenly, 2 days later, it isn’t.

Another point to note: any computation on the adjacency matrix representation of your data is not using a sparse matrix data structure any more, and your memory requirements increase accordingly. An integer matrix of size n requires n*n*4 bytes.

Handling memory limitations

The most commonly recommended way to handle memory limitations in R is not to have all your data in memory at once. Use a database to store the full dataset, and query it to get just the ‘slice’ (a subset of cases or a subset of variables) you need to analyze. There are various connector libraries for R that allow you to query data stored in SQL or noSQl databases, or just output the result of the query to a .csv file for import into R.

If this isn’t feasible cloud computing services such as Amazon EC2 let you access more memory relatively cheaply. EC2 instance specifications are here, with pricing here. Choose a 64-bit Unix image to start, log in over ssh and install R, then copy your script to the server to run it. At $1.80 per hour for the double-extra-large 30GB instance, it’s worth a shot! Actually, in this range there are likely other issues limiting the amount of memory a single object can occupy, but some of the smaller instances allow you to run your analysis with double or triple the memory typically available on a 32-bit PC. More information on running R on EC2 here, here and here.

Worth knowing: the command to run a script that keeps running when you log out of the terminal is:
nohup R < infile.R > outfile.txt &This is important even if you don’t plan to log out, as R’s propensity to 100% CPU usage if left unchecked can do weird things to your ssh session.

PostGIS install on Ubuntu 12.04

2013-05-26T18:16:24+00:00

This post shows you (and me when I forget in a couple of months!) how to set up a basic toolchain for spatial queries on Ubuntu 12.04. After installing PostgreSQL and PostGIS, we'll load data from a shapefile, try some queries, and display the spatial database layer in QGIS

Step 1: Install PostGIS

Install the packages you’ll need

sudo apt-get install postgresql-9.1-postgis postgresql-contrib-9.1 pgadmin3

Become postgres user

sudo -u postgres

Create a separate database user for your spatial databases

createuser -SdR gisuser

Create your database

createdb -E UTF8 -O gisuser dbname

Load the pl/pg procedural language (the postgis functions are written using this)

createlang plpgsql dbname

Load the postgis spatial functions

psql -d dbname -f /usr/share/postgresql/9.1/contrib/postgis-1.5/postgis.sql

Load the spatial reference systems table

psql -d dbname -f /usr/share/postgresql/9.1/contrib/postgis-1.5/spatial_ref_sys.sql

Change table ownership

psql dbname -c "ALTER TABLE geometry_columns OWNER TO gisuser"

psql dbname -c "ALTER TABLE spatial_ref_sys OWNER TO gisuser"

Change the postgres user’s database password so you can use it to log in with pgadmin3, and the gisuser’s password so you know what it is

psql dbname

ALTER USER postgres WITH PASSWORD 'newpassword';

ALTER USER gisuser WITH PASSWORD 'gispassword';/q

Log out of postgres user session

exit

Step 2: Look at your new database in pgadmin3

Load up pgadmin3. click the plug icon to create a new connection.

name: localhost
server: localhost
user: postgres
password: the database password you set for the postgres user

If it connects successfully, expand server groups > servers > localhost > databases, and click the name of your database (if you can’t expand something, click once to connect, then try again). you’ll see the objects that make up your database. to see tables, expand schemas > public > tables. if you click on a table, you’ll see the SQL create statement that made it in the pane in the lower right corner of the window.

You should see two tables: spatial_ref_sys and geometry_columns

Step 3: Try a query in pqadmin3

Click the SQL magnifying glass icon in the pgadmin3 toolbar

Try the following:

SELECT srtext FROM spatial_ref_sys WHERE srid = 4269;

Run the query using the green pay button in this window’s toolbar. you should see information about this projection.

Step 4: Load the zipcodes shapefile into the database

Download a file of california zipcodes from http://www.census.gov/geo/cob/bdy/zt/z500shp/zt06_d00_shp.zip

sudo cp Downloads/zt06_d00_shp.zip /var/lib/postgresql/

sudo chmod 777 /var/lib/postgresql/zt06_d00_shp.zip

sudo su - postgresunzip zt06_d00_shp.zip

shp2pgsql -c -D -s 4269 -I zt06_d00.shp public.zipcodes > zips.sql

psql -d dbname -f zips.sql

psql dbname -c "ALTER TABLE zipcodes OWNER TO gisuser"

exit

Step 5: Look at the zipcodes spatial table in QGIS

sudo apt-add-repository ppa:ubuntugis/ubuntugis-unstable

sudo apt-get update

sudo apt-get install qgis

Start qgis from terminal or gui menu

In top menu, click layer > add postgis layer

name: dbname
service:
host: localhost
database: dbname
user: gisuser
password: gispassword

Click test connection - if it works, check the boxes to save username and password then click ok.

Click connect to connect to the connection you just set up. select the zipcodes table and click add.

Zipcode polygons appear!