Mapping YouTube Views

Mapping Youtube Views

YouTube has been an entertainment phenomenon ever since it arrived on the internet in 2006. Its reach is staggering, bringing videos to every corner of the Earth. In every country of the world the word YouTube is synonymous with online entertainment. I’ve always been fascinated by the maps YouTube provided in the “statistics” section of the videos. Every country in the world would be represented on the most popular videos. It’s a shame YouTube has removed these statistics from public. Now it’s only possible to see these stats if the uploader makes them available.

youtube anayltics

Youtube has a great analytics platform for content creators. It has an interactive map built into the creator studio which is great for geographic analysis. There are ways to export this data using the API tools YouTube provides. I thought it would be fun to take this data a creator a couple maps of my own. Instead of using the API I acquired the data the old fashion way: copy and pasting.

I decided to make a map of every country except the United States. Since 95% of my views come from the United States, some methods of separating the data would make other countries almost indistinguishable on a map.

After copy and pasting the lifetime statistics from the interactive map portion of the YouTube analytics page, I added them to an excel spreadsheet and created a .csv document to add to ArcMap. There was limited parsing to be done. All the data was already organized. I removed the fields I wasn’t going to be using like watch time, average view duration, and average percentage viewed. In the future it might be interesting to map these variables but today I’m just going to focus on raw view numbers.

I’m using the template that I used for my WordPress map. It uses a basemap and a borders shapefile from thematicmapping. This easily allows me to join the csv to the shapefile table and we’re quickly off to the cartographic races.

Compared to the WordPress site, my YouTube channel has a much more impressive geographic reach. Out of the 196 countries on Earth, 134 of them have clicked on a video I host on my channel. This is great because it means I’m over halfway to completing my collection of all countries.

The map includes all of the countries except the United States with over 11,000 views. I decided to use 10 natural breaks in the colors to add more variation to the map. Experts say that the average human eye can’t differentiate more than 7 colors on a map. In this case it is purely a design choice.

YoutubeViews_sansUSA

It looks like I have to carry some business cards with me next time I go to Africa. It’s nice to see such a global reach. It feels good to know that, even for a brief second, my videos were part of someone’s life in greater global community.

Mapping WordPress Views

It’s been a year since I started writing this blog. Time, as always, seems to fly by. Blogging here has allowed me to development my writing, communication, and research skills. I thought I’d do something WordPress related to celebrate a year of success and hopefully many more to come. I thought of a quick and easy project to map the geographic locations of visitors to this blog over the last year. It’s always interesting to see what countries people or visiting from and I’m always surprised at the variety.

Data acquisition is simple for this project. WordPress make statistic available so it’s not difficult to acquire the statistics or parse the data since the provided data is pretty solid. The one thing that needs to be done is combining the 2016 and 2017 data into one set since WordPress automatically categorizes visitation statistics by year. Since this blog has only been active for 2016 and 2017, there are only two datasets to combine. This is easily done using a spreadsheet and by having the WordPress statistics available.

The data suggests growth, with 2017 already overtaking the entirety of 2016 in terms of views. It’s also interesting that 2017 is more geographically diverse, consisting of 49 unique countries compared to 31 in 2016. I decided it would be appropriate to create 3 maps, one for 2016, one for 2017, and one combing the two. This would allow one to interpret the differences between the years and see the geographic implications as a whole.

I began by exporting the data into a CSV file to be read by Arcmap. I decided on the blank world map boundaries from thematicmapping.org for a basemap. The previously prepared CSV was then attached to the basemap via the “name” entry which reconciles both data tables with the name of each country. Once the data is on the map it’s over to the quantified symbology to adjust the color scheme and create a choropleth map. I choose to break the data 7 ways and to remove the borders from the country to give it a more natural, pastel look.

In layout view the design touches are added. A title was placed at the top and the map was signed. The legend was added and I used one of the tricks I’ve found useful to format it. First I add the legend with all the default settings and get the positioning correct. After it’s in position I double check that the data components are correct. Then “covert to graphics” is selected to turn the legend into an editable collection of graphic elements. The only downside to this is that it no longer reflects changes in the data so making sure the data is correct before converting is critical. After it’s been converted, selecting “ungroup” will separate each of the graphical elements, allowing the designer to manipulate each individually. I find that this is a personally easier and more intuitive to work with. After editing, the elements can be regrouped and other elements like frames and drop shadows can be added.

Wordpress2016

Full Resolution

Making the 2017 map followed to same methodology.

Wordpress2017

Full Resolution

Combining the two datasets was the only methodological variation when making the final map.

WordpressAll

Full Resolution

At a glance, the trends seem typical. North America is represented in the data as is Europe. There is an unexpected representation in Asia which might be due to the several articles that have been written about China. It’s also neat seeing visitors from South America. The rarest country is New Caledonia, a French Territory in the Pacific about 1000 miles of the coast of eastern Australia.

In the future it would be interesting to create a map that normalizes the number of visitors according to the population of the countries. This would create a map that shows which countries visit at a higher or lower rate per capita. This would illustrate which countries are more drawn to the content on the site.

Here’s to hoping for more geographical variation in the future. Maybe one day all countries will have visited Thoughtworks.

Mapping Malicious Access Attempts

Data provides an illuminating light in the dark in the world of network security. When considering computer forensics assessments, the more data available, the better. The difference between being clueless and having a handle on a situation may depend on one critical datapoint that an administrator may or may not have. When data metrics that accompany malicious activity are missing, performing proper forensics of the situation becomes exponentially more difficult.

Operating a media server in the cloud has taught me a lot about the use and operation of internet facing devices. This is provided by a 3rd party who leases servers in a data center. This machine runs Lubuntu, a distribution of Linux. While I’m not in direct control of the network this server is operating on, I do have a lot of leeway in what data can be collected since it is “internet facing” meaning it connects directly to the WAN, allowing it to be be interacted with as if it was a standalone server.

If you’ve ever managed an internet facing service you’ll be immediately familiar with the amount of attacks targeted at your machine, seemingly out of the blue. These aren’t always manual attempts to gain access or disrupt services. These attempts are normally automated and persistent, meaning someone only has to designate a target and the botnets and other malicious actors, tasked with the heavy lifting, begin a persistent threat, an attack that is capable of operating on its own, persistently, without human interaction.

While learning to operate the server, I found myself face to face with a number of malicious attacks directed at my IP address seeking to brute force the root password in order to establish an SSH connection on the server. This would essentially be an attacker gaining complete control of the server and a strong password is the only thing sanding between the vicious world of the internet and the controlled environment of the server. This list provided a number of IP addresses which, like any good geographer, I was eager to put the data on a map to spatially analyze what part of the world these attacks were coming from to glean some information on who and why these actors were targeting my media server, an entity with little to no tangible value beyond the equipment itself.

Screenshot_20170527-000900

This log of unauthorized access attempts can be found in many mainstream Linux distributions in the /var/log/auth.log folder and by using the following bash command in the terminal it is possible to count how many malicious attempts were made by which unique IP and rank them by count.

grep "Failed password for" /var/log/auth.log | grep -Po "[0-9]+\.[0-9]+\.[0-9]+\.[0-9]+" \ | sort | uniq -c

Running
this command will allow a system administrator to quickly see which
IP addresses failed to authenticate and how how many times they
failed to do so.

Parsing operations like this allow system administrators to quickly see which IP address failed to authenticate and how many times they failed to do so. This is part of the steps that turn raw data into actionable knowledge. By turning this raw data into interpretable data we actively transforming it’s interpretability and by result its usability.

This list is easily exported to an excel spreadsheet where the IPs can be georeferenced using other sources like abuseipdb.com. Using this service I was able to link each IP address and the number of the access attempts to the geographic location associated with it at the municipal, state, and national level.

After assigning each IP address a count and a geographic location I was ready to put the data on map. Looking over the excel spreadsheet showed some obvious trends out of the gate. China seems to be a majority of the access attempts. I decided to create 3 maps. The first would be based on the city the attack originated from and a surrounding, graduated symbology that expressed the number of attacks that originated from the data point. These would allow me to see at-a-glance where the majority of the attacks globally and spatially originated.

The first map was going to be tricky. Since the georeferecing built-in to ArcMap requires a subscription to the Arc Online service to use, I decided to parse my own data. I grouped all these entries and consolidated them by city. Then went through and manually entered the coordinates for each one. This is something I’d like to find an easier solution for in the future. When working with coordinates, it’s also important to use matching coordinate systems for all features in ArcMap to avoid geographic inaccuracies.

map2b

Full resolution – http://i.imgur.com/sY0c7IJ.jpg

Something I’d like to get better at is reconciling the graduated symbology between the editing frame and the data frame. Sometimes size inacuracies can throw off the visualization of the data. This is important to consider when working with graduated symbology, like in this case, where the larger symbols are limited to 100 pts.

The second map included just countries of origination, disregarding the cities metric. This choropleth map was quick to create, requiring just a few tweaks in the spreadsheet. This would provide a quick and concise visualization of the geographic national origins of these attacks in a visually interpretable format. This would be appropriate where just including cities in the metric would be too noisy for the reader.

The following is a graphical representation of the unauthorized access attempts on a media server hosting in the cloud with the IPs resolved to the country of origin. Of the roughly 53,000 access attempts between May 15 and May 17, over 50,000 originated from China.

To represent this chloropleth map I saved the data into a .csv file and imported it into ArcMap. Then came the georeferencing. This was easily done with a join operation with a basemap that lists all the countries. The blank map shapefile was added twice. One for the join and one for that background. During the join operation I removed all the countries I didn’t have a count for. Then I sent this layer to the top layer so all the colorless empty countries would appear behind the countries with data. This is one thing I continue to love and be fascinated with about ArcMap, the number of ways to accomplish a task. You could use a different methodology for every task and find a new approach each time.

map3

Full resolution – http://i.imgur.com/XyqOexM.png

I decided the last map should be the states in China to better represent where attacks were coming from in this area of the world. The data was already assembled so I sorted the excel spreadsheet by the country column and created a new sheet with just the Chinese entries. I was able to refer to the GIS database at Harvard which I wrote about in an earlier article concerning the ChinaX MOOC they offered. This was reassuring considering my familiarity with the source. The excel spreadsheet was then consolidated and a quick join operation to the newly downloaded shapefile is all it took to display the data. A choropleth map would be appropriate for this presentation. I had to double check all the state names to make sure there were no new major provincial changes had been missed by the dataset considering the shapefile was from 1997.

map4

Full resolution – http://i.imgur.com/ZhJpHLM.png

While the data might suggest that the source of the threats are originating from China, the entities with a low number of connections might be the most dangerous. If someone attempts to connect 1 time, they might have a password that they retrieved the means of a Trojan horse or a password leaks. These are the entities that may be worth investigating. All these entries were listed in the abuseipdb database so they all had malicious associations. While these threats aren’t persistent in that they are automated, they might suggest an advanced threat or threat actor.

Some of the data retrieval might be geographically inaccurate. While georeferencing IP addresses has come a long way, it’s still not an entirely empirical solution. Some extra effort might be required to make sure the data is as accurate as possible.

How does this data help? I can turn around and take the most incessant threats and blacklist them on the firewall so they’ll be unable to even attempt to log in. Using this methodology I can begin to create a blacklist of malicious IPs that I can continue building upon in the future. This allows me to geographically create a network of IPs that might be associated with a malicious entity.

The Internet can be a dangerous place, especially for internet facing devices that aren’t protected by a router or other firewall enabled devices. Nothing is impossible to mitigate and understand for a system administrator that is armed with the correct data. The epistemological  beauty of geography is the interdisciplinary applications that can be made with almost anything. Even something is insignificant as failed access attempts can be used to paint a data-rich picture.

Mapping the Construction of I-485

 

485-by-year-final

The geography of I-485’s construction begins in the south of the city. This immediately starting providing relief for the increasing volume of traffic due to the growth of the suburbs in south Charlotte and near the South Carolina border. The next order of business was connection the attractions, university, and high population suburbs in the northeast of the city. Finally, the west quadrant of the road was completed, alleviating traffic on Billy Graham Parkway around the airport and connecting the I-85 – I-77 bypass in the northwest.

I-485 broke ground in 1988 and was a completed beltway in 2015. It took 27 years to build 67.61 miles of the interstate at a rate of 2.504 miles per years. Compared to other beltways, this is a relatively lengthy period of construction.

I-270, a beltway around Columbus, Ohio took 13 years to build, being completed in 1975. It equates to 4.228 miles per year construction. This partially due to the stimulus provided by Federal-Aid Highway Act of 1956 championed by Dwight Eisenhower which provided resources for state governments to jumpstart construction on the interstate system that we know today.

I-465, the beltway around Indianapolis, broke ground in 1959 and, drawing from the highway stimulus, its 52.79 miles were completed at a blistering 4.799 miles per year.

Constructing this collage of maps in ArcMap provided exposure to some of the more intermediate functions of the design toolkit. After acquiring the interstate highway data from the Mecklenburg Open Mapping portal, 17 data frames were created to represent the 17 phases of I-485 construction according to the history section of the I485 article which cites The Charlotte Observer.

In the layout view, I navigating to the data frame tab in the properties menu to set the extent of each map to mimic the extent of the first data frame I set manually. This, I believe, was the optimal design choice to map the different phases of construction. This is the shortcut for manually adjusting each map’s extent which would have taken a considerably longer time to accomplish. It also ensures consistency with the design.

I enabled grids in the layout view which is nice for checking the alignment of elements at a glance. Also, I adjusted the page layout to allow a custom margin (20 inches by 8) when exporting the map as an image.

Finally, I made use of the distribute tool which is in the right click menu of the layout view. This easily allowed me to align all the rows automatically, eliminating the need to manually align each individual data frame. Each row was aligned horizontally and then vertically to ensure each was in the proper position. The same method was used for the text above each frame. Some instances of the text, however, needed to be adjusted manually.

This was a fun exercise. I could take it further in the future by color coding the map according to how many lanes each section has. This would allow the presentation of lane widening projects which are still ongoing on 485 as well as many other beltways and interstates around the country. It would be interesting to compile all the phases of construction into a short movie or .gif. This would require going back and upscaling each data frame individually to get a useable resolution.

Reflecting on the design, I’m not sure how to deal with the text labels. Looking at a glance it can be confusing which map a label is referring to, the map above or below. Perhaps I could have put the label in the middle of the beltway to clarify exactly what map is being labeled. This might have allowed each map to be bigger. This might detract from the negative space in the middle of the beltway and give a cluttered appearance. The use of lines might have been appropriate to border each map with its label. This might have made the map too busy. I’m happy with how it turned out. It’s always good to consider the alternatives.

Mapping the Electoral College, Reality vs. Hypothetical

How much does your vote actually matter? This year’s presidential election was an interesting affair to say the least. The votes haven’t been completely counted as of this writing but the winner of the popular vote and the electoral are likely not be the same candidate.

The popular vote winner / electoral vote winner discrepancy isn’t unprecedented. In 2000 George Bush won the presidency despite Al Gore winning the popular vote. We’d have to back to the 1800s to find the other two instances, Benjamin Harrison’s electoral victory over Grover Cleveland’s popular victory and Samuel Tilden beating Ruther B Hayes, who was the winner of the popular vote. The latter was overturned in the Compromise of 1877, promising the removal of federal troops from the South in an attempt to satisfy the popular sentiment in exchange for a Hayes presidency.

Hillary Clinton will likely become the 4th presidential candidate in American history to win the popular vote but lose the electoral vote. This has brought the role of the electoral college into question in many circles. What role does the electoral college play?

In a representative democracy like the United States, people elect officials to represent their interests. The 538 electors that make up the electoral college include the 435 representatives of the house, the 100 senators representing the states, and 3 electors representing the people living in Washington D.C. Typically, the electors will vote in accordance with the popular vote but it’s interesting to note they are not legally bound to do so. The college was a system that was originally implemented to assure that states with small populations would have a fair say in the elections. Article II, section 1, clause 2 of the constitution is the origin of the electoral college’s use in elections.

Let’s look at how the electoral college represents the population.

electoral-college-2016-final

Higher population, more electoral votes. Electoral college delegate redistribution to reflect changing populations is left up to the state. Let’s take a look at population and see how it compares.

population-2015

At a glance everything looks fine. Colors are similar and correspond between the two maps. Let’s compare electoral votes and population mathematically. By dividing the population by the electoral votes we can see how much of a state’s population is represented by 1 electoral vote.

electoral-weight

Lighter colored states have lower populations per electoral vote meaning someone’s personal vote is worth more in a light-colored state than a dark-colored state. For example, voting in Wyoming, the lowest population per electoral vote, will give your vote 3.62 times more electoral weight than a vote in California, the highest population per electoral vote. This seems strange when first considering the differential. Let’s take a look at voter turnout in the 2016 election.

voter-turnout-2016

90 million voters out of the estimated 231 million that are eligible to vote didn’t vote in the 2016 general election. According to statisticbrain.com 44.4% of people didn’t exercise their right to vote, one of the most critical rights in a democratic society. In the above map we can see some interesting correlation. California’s turnout is the lowest after Hawaii, is it fair that California would receive population-based electoral votes considering the amount of voter apathy? Should Florida receive the same amount of electoral votes as New York despite having a notable higher voter turnout? Should voter turnout even matter at all when considering the allocation electoral votes? Does it play a role in reality?

Let’s adjust the electoral vote per population by the percent of voter turnout.

population turnout adjusted.png

Nothing significantly different. The Northwest voting block is relatively stronger. The Rust Belt as a region sees an increase in voting influence per person on the electoral college. The California-Wisconsin comparison made earlier has seen its ratio drop to 2.71 compared to 3.62 meaning a vote in Wyoming still carries 2.71 times the electoral influence as a vote in California.

Let’s see what the electoral college would look like if it were adjusted to reflect these numbers. If we take the total voting population of each state and redistributed the 538 electors among them, excluding D.C.

electoral-college-redrawn

Of course in reality the number of electors has to be a whole number. You can’t have .5 of an elector. A few things to consider: Florida’s electoral power has significantly increased. California’s has decreased. The Great Plains states have had their electoral influence lowered. The Rust Belt has seen an increase across the board. Of course in this scenario, changing the number of electoral votes based on voter turnout might encourage and discourage people to show up as their state’s electoral influence waxes or wanes.

Perhaps this constant evolution of the electoral college would be a viable solution. As a representative democracy, citizens should expect accurate representation every time they fill out a ballot. Maybe this kind of feedback on voter turnout is excessive and people may feel punished by the apathy of voters they may share a state with. It’s also important to consider that, in reality, each state has 2 senators and at least 1 representative of the house, making the lowest possible electoral votes for a state 3.

If the election were held with this electoral college, including D.C.’s 3 votes, Trump would have beat Clinton 312.9 to 225.1. In reality, if Michigan goes in favor of Trump, the count will be 316 to 228 which is surprisingly similar. It’s not hard to imagine some federal statisticians crunching numbers and tweaking the electoral college in a back room somewhere in Washington. It’s interesting to consider that if we remove the Rust Belt states of Pennsylvania, Ohio, Michigan, and Wisconsin, the count would be 222.6 to 222.2 proving just how much of a role the Rust Belt plays in this scenario as well as deciding the election in reality.

Nobody can predict the  future course democracy will take in our country. Perhaps the representation of the electoral college will change, be replaced, or continue on in its current state. As of now, it plays a key role in directing the will of the nation through the representatives that champion the American democratic process.

They say the devil is in the details. And the details of this democracy are definitely geographic.

Mapping American Marijuana Laws 2016

After one of the most outrageous election cycles in American history, the geography of American marijuana laws quietly changed. 3 states joined Washington, Washington D.C., Alaska, Oregon, and Colorado with successful ballot measures legalizing the recreational use of marijuana. Florida also joins the ranks of states where psychoactive variations of THC can be used for medical purposes.

marijuana-legality
Marijuana legality as of November 10th 2016 in the United States

The above map shows the updated status of marijuana legality across the nation reflecting to successful ballot iniatives of the 2016 election cycle.

The geography is interesting to consider. The west coast is the geographic bastion of legal recreational marijuana use. This isn’t hard to believe considering the historically liberal attitudes these states have held when legislating the plant. In the mid 1990s the effort to legalize medical marijuana originated in California and spread in a similar fashion.

Massachusetts, nestled in the similarly minded New England region of the northeast united states, becomes the first east coast state to legalize marijuana recreationally. Considering the trend of medical marijuana in the region, it is likely to become similar to the west coast, with legality slowly permeating through the ballots of all the northeastern states.

The southeast, more conservative with this type of legislature, allows nonpsychoactive treatment of a limited number of medical conditions. This includes THC in droper or pill form but not in its plant form.  Florida marks the first step in introducing legislation to the region in the form of psychoactive medical treatment. This is the plant form of marijuana that contains THC that many people may be familiar with.

Washinton D.C.’s previous decision to legalize recreational use is interesting considering it’s geography. It’s nestled at the crossroads of the southeast and New England, two regions which have different legislative opinions regarding the plant. D.C. is unique in that recreational use is legal but you can’t legally buy it anywhere in the district. Marijuana dispensaries are not allowed to operate within its borders.

In the middle of the country in the breadbasket of the Great Plains, marijuana remains illegal in all forms, medical or recreationally. As of today, 7 states remain where marijuana is completely illegal. If marijuana continues to be a state’s decision, these states will likely be the last to legalize medically and recreationally, if they ever decide to

Mapping Dropout Rates in Charlotte, NC

“Education is the most powerful weapon which you can use to change the world.”

-Nelson Mandela

This project looked into the dropout rates in the city of Charlotte over 4 different years; 2004, 2008, 2010, 2012.

I used ArcMap to map the dropout rates that were reported in the Quality of life reports the city of Charlotte publishes yearly.

Quality of Life reports

These pdf reports are deprecated after the release of the new GIS applet to report this data.

Quality of Life Explorer

Methodology

I created a spreadsheet to curate the data of several of the reports.

I then compared the data to create the values of change. Where no values were present, I left the value as null. If there was no data in 2008 but data in both 2004 and 2010, I compared 2004 directly to 2010.

For 2012 I used the included spreadsheet data. The shapefiles were different because the Quality of Life organization changed how they collected data, making direct comparison with the software difficult.

neighborhoods

I used the following scheme from colorbrewer for the data maps and used an included ArcMap scheme for the change maps.

color brewer.PNG

I included a map with all 4 years visible for easy comparison.

charlotte dropout map.png

To classify the data I joined the excel spreadsheet with the neighborhood shapefile, using the NSA neighborhood identifier.

classify-the-data

I used the following 6-class manual classification across the 4 maps.

classify

I used the following 9-class manual classification to map the change maps.

classify

This project gave me exposure to the manual input of data, which is mechanical and boring but I find intrinsically rewarding for some reason. I had to manually enter the data from the quality of life pdfs to a spreadsheet which was time intensive, taking about a hour per report. In the future if I’m ever parsing data in this format, I’ll use an autoscrolling feature to automatically scroll the reports while entering the data at the save time. This, in theory, would take half as long to enter the data.  This exposure to data entry opens the door to other presentations of data through ArcMap and other data manipulation applications in the future.