Title: Collecting "Big Data" to Understand the Impact of Global Internet Censorship and Surveillance Abstract: ======== Censorship and surveillance on the Internet is a global phenomena with far-reaching and transformative effects on society, yet research on this phenomena is still very nascent and is limited in scope (e.g., to a single country or a short timeframe). Important questions go unanswered. For example, how commonly are support websites made inaccessible to at-risk populations (such as domestic abuse victims) because they are mis-categorized as pornography? What role do software and Internet media companies, either intentionally or unwittingly, play in state surveillance in various parts of the world? Who decides which keywords trigger censorship or surveillance in different market segments for different countries? How are the national-scale firewalls that limit Internet traffic evolving? Longitudinal datasets that are global in scope are needed to truly understand the impact and nature of Internet censorship and surveillance, but how do you collect large data sets about a phenomena that is clouded in secrecy? In this talk I'll discuss two research thrusts that my group is pioneering that each have the potential to scale to truly "big data". One research thrust is TCP/IP side channels, where it's possible to measure conditions about the Internet between any two points in the world without having any infrastructure at either point or in between. In other words, using a single Linux machine here in North America, we can, for example, determine if an IP address in Zimbabwe can communicate with another IP address in Saudi Arabia or if a firewall restricts their communications. It sounds like magic, but I'll explain how this is made possible through spoofed return IP addresses and careful monitoring of remote machines' network stack state. Our goal is to measure Internet censorship everywhere, all the time. The second research thrust is reverse engineering. We are collaborating with the Citizen Lab at the University of Toronto to reverse engineer closed-source software and reveal its secrets. Some companies implement censorship and surveillance within their software, while others make claims about privacy and cryptography that aren't true and thereby put the communications of journalists, activists, ethnic minorities, and many others at risk. The large amount of software that's out there and is being used by at-risk populations makes this an essentially "big data" problem.