Mapping YouTube Views

Mapping Youtube Views

YouTube has been an entertainment phenomenon ever since it arrived on the internet in 2006. Its reach is staggering, bringing videos to every corner of the Earth. In every country of the world the word YouTube is synonymous with online entertainment. I’ve always been fascinated by the maps YouTube provided in the “statistics” section of the videos. Every country in the world would be represented on the most popular videos. It’s a shame YouTube has removed these statistics from public. Now it’s only possible to see these stats if the uploader makes them available.

youtube anayltics

Youtube has a great analytics platform for content creators. It has an interactive map built into the creator studio which is great for geographic analysis. There are ways to export this data using the API tools YouTube provides. I thought it would be fun to take this data a creator a couple maps of my own. Instead of using the API I acquired the data the old fashion way: copy and pasting.

I decided to make a map of every country except the United States. Since 95% of my views come from the United States, some methods of separating the data would make other countries almost indistinguishable on a map.

After copy and pasting the lifetime statistics from the interactive map portion of the YouTube analytics page, I added them to an excel spreadsheet and created a .csv document to add to ArcMap. There was limited parsing to be done. All the data was already organized. I removed the fields I wasn’t going to be using like watch time, average view duration, and average percentage viewed. In the future it might be interesting to map these variables but today I’m just going to focus on raw view numbers.

I’m using the template that I used for my WordPress map. It uses a basemap and a borders shapefile from thematicmapping. This easily allows me to join the csv to the shapefile table and we’re quickly off to the cartographic races.

Compared to the WordPress site, my YouTube channel has a much more impressive geographic reach. Out of the 196 countries on Earth, 134 of them have clicked on a video I host on my channel. This is great because it means I’m over halfway to completing my collection of all countries.

The map includes all of the countries except the United States with over 11,000 views. I decided to use 10 natural breaks in the colors to add more variation to the map. Experts say that the average human eye can’t differentiate more than 7 colors on a map. In this case it is purely a design choice.

YoutubeViews_sansUSA

It looks like I have to carry some business cards with me next time I go to Africa. It’s nice to see such a global reach. It feels good to know that, even for a brief second, my videos were part of someone’s life in greater global community.

Mapping WordPress Views

It’s been a year since I started writing this blog. Time, as always, seems to fly by. Blogging here has allowed me to development my writing, communication, and research skills. I thought I’d do something WordPress related to celebrate a year of success and hopefully many more to come. I thought of a quick and easy project to map the geographic locations of visitors to this blog over the last year. It’s always interesting to see what countries people or visiting from and I’m always surprised at the variety.

Data acquisition is simple for this project. WordPress make statistic available so it’s not difficult to acquire the statistics or parse the data since the provided data is pretty solid. The one thing that needs to be done is combining the 2016 and 2017 data into one set since WordPress automatically categorizes visitation statistics by year. Since this blog has only been active for 2016 and 2017, there are only two datasets to combine. This is easily done using a spreadsheet and by having the WordPress statistics available.

The data suggests growth, with 2017 already overtaking the entirety of 2016 in terms of views. It’s also interesting that 2017 is more geographically diverse, consisting of 49 unique countries compared to 31 in 2016. I decided it would be appropriate to create 3 maps, one for 2016, one for 2017, and one combing the two. This would allow one to interpret the differences between the years and see the geographic implications as a whole.

I began by exporting the data into a CSV file to be read by Arcmap. I decided on the blank world map boundaries from thematicmapping.org for a basemap. The previously prepared CSV was then attached to the basemap via the “name” entry which reconciles both data tables with the name of each country. Once the data is on the map it’s over to the quantified symbology to adjust the color scheme and create a choropleth map. I choose to break the data 7 ways and to remove the borders from the country to give it a more natural, pastel look.

In layout view the design touches are added. A title was placed at the top and the map was signed. The legend was added and I used one of the tricks I’ve found useful to format it. First I add the legend with all the default settings and get the positioning correct. After it’s in position I double check that the data components are correct. Then “covert to graphics” is selected to turn the legend into an editable collection of graphic elements. The only downside to this is that it no longer reflects changes in the data so making sure the data is correct before converting is critical. After it’s been converted, selecting “ungroup” will separate each of the graphical elements, allowing the designer to manipulate each individually. I find that this is a personally easier and more intuitive to work with. After editing, the elements can be regrouped and other elements like frames and drop shadows can be added.

Wordpress2016

Full Resolution

Making the 2017 map followed to same methodology.

Wordpress2017

Full Resolution

Combining the two datasets was the only methodological variation when making the final map.

WordpressAll

Full Resolution

At a glance, the trends seem typical. North America is represented in the data as is Europe. There is an unexpected representation in Asia which might be due to the several articles that have been written about China. It’s also neat seeing visitors from South America. The rarest country is New Caledonia, a French Territory in the Pacific about 1000 miles of the coast of eastern Australia.

In the future it would be interesting to create a map that normalizes the number of visitors according to the population of the countries. This would create a map that shows which countries visit at a higher or lower rate per capita. This would illustrate which countries are more drawn to the content on the site.

Here’s to hoping for more geographical variation in the future. Maybe one day all countries will have visited Thoughtworks.

Mapping Malicious Access Attempts

Data provides an illuminating light in the dark in the world of network security. When considering computer forensics assessments, the more data available, the better. The difference between being clueless and having a handle on a situation may depend on one critical datapoint that an administrator may or may not have. When data metrics that accompany malicious activity are missing, performing proper forensics of the situation becomes exponentially more difficult.

Operating a media server in the cloud has taught me a lot about the use and operation of internet facing devices. This is provided by a 3rd party who leases servers in a data center. This machine runs Lubuntu, a distribution of Linux. While I’m not in direct control of the network this server is operating on, I do have a lot of leeway in what data can be collected since it is “internet facing” meaning it connects directly to the WAN, allowing it to be be interacted with as if it was a standalone server.

If you’ve ever managed an internet facing service you’ll be immediately familiar with the amount of attacks targeted at your machine, seemingly out of the blue. These aren’t always manual attempts to gain access or disrupt services. These attempts are normally automated and persistent, meaning someone only has to designate a target and the botnets and other malicious actors, tasked with the heavy lifting, begin a persistent threat, an attack that is capable of operating on its own, persistently, without human interaction.

While learning to operate the server, I found myself face to face with a number of malicious attacks directed at my IP address seeking to brute force the root password in order to establish an SSH connection on the server. This would essentially be an attacker gaining complete control of the server and a strong password is the only thing sanding between the vicious world of the internet and the controlled environment of the server. This list provided a number of IP addresses which, like any good geographer, I was eager to put the data on a map to spatially analyze what part of the world these attacks were coming from to glean some information on who and why these actors were targeting my media server, an entity with little to no tangible value beyond the equipment itself.

Screenshot_20170527-000900

This log of unauthorized access attempts can be found in many mainstream Linux distributions in the /var/log/auth.log folder and by using the following bash command in the terminal it is possible to count how many malicious attempts were made by which unique IP and rank them by count.

grep "Failed password for" /var/log/auth.log | grep -Po "[0-9]+\.[0-9]+\.[0-9]+\.[0-9]+" \ | sort | uniq -c

Running
this command will allow a system administrator to quickly see which
IP addresses failed to authenticate and how how many times they
failed to do so.

Parsing operations like this allow system administrators to quickly see which IP address failed to authenticate and how many times they failed to do so. This is part of the steps that turn raw data into actionable knowledge. By turning this raw data into interpretable data we actively transforming it’s interpretability and by result its usability.

This list is easily exported to an excel spreadsheet where the IPs can be georeferenced using other sources like abuseipdb.com. Using this service I was able to link each IP address and the number of the access attempts to the geographic location associated with it at the municipal, state, and national level.

After assigning each IP address a count and a geographic location I was ready to put the data on map. Looking over the excel spreadsheet showed some obvious trends out of the gate. China seems to be a majority of the access attempts. I decided to create 3 maps. The first would be based on the city the attack originated from and a surrounding, graduated symbology that expressed the number of attacks that originated from the data point. These would allow me to see at-a-glance where the majority of the attacks globally and spatially originated.

The first map was going to be tricky. Since the georeferecing built-in to ArcMap requires a subscription to the Arc Online service to use, I decided to parse my own data. I grouped all these entries and consolidated them by city. Then went through and manually entered the coordinates for each one. This is something I’d like to find an easier solution for in the future. When working with coordinates, it’s also important to use matching coordinate systems for all features in ArcMap to avoid geographic inaccuracies.

map2b

Full resolution – http://i.imgur.com/sY0c7IJ.jpg

Something I’d like to get better at is reconciling the graduated symbology between the editing frame and the data frame. Sometimes size inacuracies can throw off the visualization of the data. This is important to consider when working with graduated symbology, like in this case, where the larger symbols are limited to 100 pts.

The second map included just countries of origination, disregarding the cities metric. This choropleth map was quick to create, requiring just a few tweaks in the spreadsheet. This would provide a quick and concise visualization of the geographic national origins of these attacks in a visually interpretable format. This would be appropriate where just including cities in the metric would be too noisy for the reader.

The following is a graphical representation of the unauthorized access attempts on a media server hosting in the cloud with the IPs resolved to the country of origin. Of the roughly 53,000 access attempts between May 15 and May 17, over 50,000 originated from China.

To represent this chloropleth map I saved the data into a .csv file and imported it into ArcMap. Then came the georeferencing. This was easily done with a join operation with a basemap that lists all the countries. The blank map shapefile was added twice. One for the join and one for that background. During the join operation I removed all the countries I didn’t have a count for. Then I sent this layer to the top layer so all the colorless empty countries would appear behind the countries with data. This is one thing I continue to love and be fascinated with about ArcMap, the number of ways to accomplish a task. You could use a different methodology for every task and find a new approach each time.

map3

Full resolution – http://i.imgur.com/XyqOexM.png

I decided the last map should be the states in China to better represent where attacks were coming from in this area of the world. The data was already assembled so I sorted the excel spreadsheet by the country column and created a new sheet with just the Chinese entries. I was able to refer to the GIS database at Harvard which I wrote about in an earlier article concerning the ChinaX MOOC they offered. This was reassuring considering my familiarity with the source. The excel spreadsheet was then consolidated and a quick join operation to the newly downloaded shapefile is all it took to display the data. A choropleth map would be appropriate for this presentation. I had to double check all the state names to make sure there were no new major provincial changes had been missed by the dataset considering the shapefile was from 1997.

map4

Full resolution – http://i.imgur.com/ZhJpHLM.png

While the data might suggest that the source of the threats are originating from China, the entities with a low number of connections might be the most dangerous. If someone attempts to connect 1 time, they might have a password that they retrieved the means of a Trojan horse or a password leaks. These are the entities that may be worth investigating. All these entries were listed in the abuseipdb database so they all had malicious associations. While these threats aren’t persistent in that they are automated, they might suggest an advanced threat or threat actor.

Some of the data retrieval might be geographically inaccurate. While georeferencing IP addresses has come a long way, it’s still not an entirely empirical solution. Some extra effort might be required to make sure the data is as accurate as possible.

How does this data help? I can turn around and take the most incessant threats and blacklist them on the firewall so they’ll be unable to even attempt to log in. Using this methodology I can begin to create a blacklist of malicious IPs that I can continue building upon in the future. This allows me to geographically create a network of IPs that might be associated with a malicious entity.

The Internet can be a dangerous place, especially for internet facing devices that aren’t protected by a router or other firewall enabled devices. Nothing is impossible to mitigate and understand for a system administrator that is armed with the correct data. The epistemological  beauty of geography is the interdisciplinary applications that can be made with almost anything. Even something is insignificant as failed access attempts can be used to paint a data-rich picture.

Comparing Genetic Results from Ancestry.com, 23andme, Genographic Project

Genetics have always been an interesting subject to me. Genes and the DNA that carries them represent something that can be traced back in time and will be around long after the individual carrying it has passed. These unique identifiers contribute extensively in the lives they create and are the invisible building blocks that make us objectively human. The fact that you can derive geographic and cultural information from genetic makeup is a fascinating contribution to the story of mankind. Certain mutations and DNA markers are geographically unique and allow geneticists and human geographers to pinpoint where, and when, on Earth these genetic differentiations occurred. It’s amazing to consider that where oral tradition, once believed to be the most effective form of relaying information between generations, has failed, science has been able to pick up the reigns and accurate surmise information that would likely have been the core component of what was passed between these generations. One’s culture, ancestors, and origin stories were a large part of ancient, familial traditions and still tug at the curiosities of modern humans, as demonstrated by the millions of people who pay to have their DNA tested by the many services that reach into the past and attempt to rekindle the ancient stories locked within the human genome.

My genetic journey began in 2014 when I became curious of my genetic origins. I knew what made me myself physically, psychologically, and culturally but I wanted to know what kind of influence my ancestry had on the many facets of my being, which of my many eccentricities had been shared, experienced, and influenced by those who came before me, and the characteristics that could theoretically be passed along after me. I wanted to know who I was at the core of my physical being. If you stripped away the cultural, environmental, temporal and geographic factors, I wanted to know what would be left and this is what I was looking for philosophically when I began looking back. Epistemologically, I’ve always enjoyed history and the personal element presented by investigating one’s own personal history created a unique and curious opportunity to consider both the history of the world as a whole and how it intertwined with the history of my ancestors. The “big picture” is comprised of many small pictures and I found myself becoming curious and motivated to discover how the small pictures in my past corroborated with the big picture of humankind.

I began by looking back into the past of both my paternal and maternal lineages, both using knowledge I had gained, first-hand sources of my elders, and the ever expanding resource that is the Internet. I quickly found myself with thousands of entries in my family tree. Each entry being a unique mystery that was fulfilling to resolve and connected intricately to the proceeding and previous mysteries. Family trees based on recorded history can only take you so far. DNA analysis picks up where that story leaves off.

Ancestry.com

Ancestry.com is a record curating service that allows users to create family trees and cite the entries using the ever-growing collection of historical records. In 2012, Ancestry.com began offering DNA testing which would allow customers to look in their past in a new way.

In 2014 I decided to try the Ancestry DNA testing kit. At the time I was unsure of my ancestry, the only information in my family lore was limited to Scottish and Irish on my maternal side and almost nothing on my paternal side besides my English surname. The exact story had been lost to time after 8 generations in the New World.

After 6 weeks, I got my results:

original
My Ancestry.com results

Going in I had no expectations so I wasn’t particularly surprised by any of the results. The majority Scandinavian result was interesting considering there was no story of it in my extended family lore. I figure these results put a lot of weight in haplogroups considering my paternal haplogroup in Scandinavian, which I found out in the 23andme testing later.

 

I went on to have the rest of my family tested to see how the results stack up. I tested my mother, my father, and my maternal grandmother. My reasoning was scientific, in that comparisons could be made due to the results and guesses on the accuracy of the test could be analyzed. It might make a difference that I was working “backwards”, submitting myself first, then my parents, then my grandparents. Typically you work your way “down” the genetic line when considering the components and relation of someone’s DNA. It’s not impossible to do but I wonder how the results would have changed if I had my grandmother tested, then my parents, then myself.

Paternal_ancestry
Paternal DNA
maternal
Maternal DNA

 

 

 

 

 

 

 

 

Looking at the results some questions are raised and the general gist of which ethnicities came from which parent is established. My original question was how can my father have 58% DNA from Great Britain and myself have only 11%. As I understood it, I should have at least half, especially considering my mother had 17%. I’m not a geneticist so my conjecture is likely not accurate. It’s also important to remember that these numbers are weighted on certain genetic markers and not necessarily indicative of pure geographic fact. These numbers are compared to native populations that still live in the area today. Looking at the results it’s also clear that practically all of my Irish and Scandinavian influence comes from my mother, despite my paternal haplogroup being Nordic. It was also interesting to see the amount of Iberian Peninsula markers in my mother’s DNA, something that has never been explained in family lore. My father’s Finland/Northwest Russian markers were interesting and likely corroborate with the European Jewish marker.

materal_grandmother
Maternal Grandmother DNA

We move farther up the genetic ladder with my maternal grandmother’s analysis. It is plain to see where the Irish influence comes from. The reduced Scandinavian numbers suggest that my maternal grandfather brought some Nordic genes to the mix. Again, there is the Iberian Peninsula influence which continues to be a mystery. This might be isolated to my maternal grandmothers side which gives me an indication of where to start looking for this influence. It seems I share the Middle East marker with my grandmother, showing that my mother could have been a carrier for this marker although it did not express itself in her results. It is possible it might have been expressed as the Caucasus marker.

Each of the three tests offers supplementary details alongside the DNA test. What I like about the Ancestry.com test in particular is the ability to export your entire genome in a plaintext file. This makes it possible to submit this textfile for additional testing, use it to look for markers in smaller curation groups, and file away for safe keeping if ever needed.

Genetic_Communities
Genetic Communities

Another component of the Ancestry.com test that has been introduced rather recently is the Genetic Communities analysis. This uses your data alongside thousands of other users’ data and records to build of profile of the “when and where” all these users might have in common. This feature has been rolled out since I took my test so it suggests that Ancestry is continuously introducing new features that can assist its customers’ research inside or outside the community.

These particular results corroborate with my research, adding another source I can use when retelling my story. It also serves as kind of an intermediary when considering the ethnicity estimate which tastes into consideration factors that are thousands of years old, and the personal research which goes back several generations. If your immediate family history is limited, this will provide some good food for thought when moving forward with research.

DNA Matches
DNA Matches

The “DNA Matches” component of the results let’s you see who shares your DNA. This service will automatically guess the degree of the relationship (1st, 2nd, 3rd cousin, etc.) and allow you to communicate with other members if you are so inclined. The results are numerous and it is continually updated as more people join the service and the database is expanded. It has functionality for adding people directly to your family tree which streamlines the process. It directly integrates with a feature that was introduced in 2015:

DNA Circles
DNA Circles

DNA Circles automatically looks through the family tree information of those who are confirmed to be related to you through DNA testing and looks for similarities. It then automatically alerts you to “circles” you may be a part of. This takes the grunt work out of manually looking through your relatives’ trees for names you recognize. It also adds another layer of corroboration that can support your independent research.

I was happy with the Ancestry.com test, happy enough to do it 4 times. Each individual test is $99 and often goes on sale for holidays like Mother’s Day, Father’s Day and Christmas. My curiosity wasn’t satiated, though, and led me to a product that had been around since 2006.

23andMe.com

23andMe differs from Ancestry.com in that its focus is medical while Ancestry.com is focused on, well, ancestry. The 23andMe test includes a traditional ethnicity test that uses a unique database and marker system, likely producing different results than the Ancestry.com test, allowing for a “second opinion” of some sort. I was personally curious to see how my results would differ using the 23andMe system compared to the Ancestry system. It is notably more expensive than Ancestry’s $99 test at a whopping $199 for the full feature test which includes all of the medical tests and results. 23andMe provides just the ethnicity testing for $99.

23andMe

The 23andMe ancestry reports add some depth compared to the Ancestry.com reports. In addition to the ancestry composition which is a percentage breakdown of your ethnic makeup, it offers a look into your paternal and maternal haplogroups as well as the amount of Neanderthal markers that are present in your DNA. It also provides the DNA matches you have with people who use the service and offers a social platform to communicate with your relatives.

imageedit_5_6293939080

The ancestry composition is interesting. All of the reports of the site give you a visual representation of your DNA. When comparing matches you’re able to see exactly which components of your DNA are shared and, pictured above, you’re able to see which chromosome segments are associated with which portion of your ancestral composition.

I was interested to see the British and Irish portion of my results at the top. This corroborates with my paternal Ancestry.com test which was overwhelmingly British. The French and German levels (Western Europe of Ancestry) were similar, adding another layer of confirmation to the results. Scandinavian, however, the majority of my reported ancestry on Ancestry.com was relegated to 0.5% suggesting there is a dramatic shift in the testing methodologies between the two sites. The East Asian results might be responsible for the slim Native American results on the Ancestry.com test. It is said that Genghis Khan has nine million descendants alive today. It might be likely that I’m one of them considering the Mongolian portion of the results. It’s also interesting to note the <0.1% of Iberian noted in the results. This is a major departure from what was suggested from Ancestry.com. The 23andMe testing methodology was not confident in the presence of Iberian markers.

The reported paternal haplogroup of I-M253 is likely responsible for the 0.5% Scandinavian result.

HG_I1_europa
I-M253 haplogroup

The map above is the likely geography where the I-M253 can be found. The paternal haplogroup is your father’s and all his male ancestor’s line all the way back until the mutation originated. My patrillineal ancestry traces back to the area in the map above and my working theory, considering all the British components, is Viking ancestry. Since there is little Scandinavian in these results and the paternal haplogroup suggests a male component in the incorporation of these genetic markers, I believe a Viking raid might be responsible for the introduction of this DNA to the British isles. If the introduction of this DNA was due to geographic and cultural drift, I believe there would be more than 0.5% because this type of DNA exchange wouldn’t be fleeting and would include more admixture over time, resulting in a larger number.

Maternal haplogroups are the other side of the coin, tracing an individuals matrillineal lineage back through all the female ancestors to a specific mutation. In my case the mutation is the U3a1 mutation. Everyone alive on Earth today can trace their maternal DNA beyond the haplogroup mutations back to one woman, Mitochondrial Eve. The U3a1 haplogroup is still in its infancy research-wise. As more people with the haplogroup are tested, the more robust the results will become. It is likely the marker that suggested the Caucasus result in my Ancestry.com maternal results.

maternalhaplogroup

I really like the scientific depth 23andMe goes into when presenting the haplogroup results. This is the final leg of the origin story. You can trace this haplogroup tree all the way back to Mitochondrial Eve, completing the ultimate beginnings of the origin story. 23andMe incorporates scientific citations into the result pages, allowing inquisitive users easy access to the original research.

The staple of the 23andMe service has to be the health reports. They are constantly rolling out new reports, as experienced by the email alerts I receive every few months. There is also a report that provides insight on traits that you’re likely to have. These 2 reports contain things like your likelihood to be able to smell asparagus, eye color, chance to have a window’s peak, and your finger length ration to name a few.

Traits

The health reports include your carrier status for, at this time, 42 known gene and disease associations. Genetics is still in its relative infancy. The human genome was completely sequenced only in 2003. That has left a short 14 years for companies like 23andMe to research the association between genes and disease. Unfortunately, not every disease has a known genetic marker. It would be nice to know (maybe) if you’re going to develop cancer based on a DNA test but it’s just not possible yet. The following are a few examples of what the report provides.

Carrier_Status

There is another section titled “Genetic Health Risks” which lets the tester known if they’re at risk for certain ailments that are associated, but not directly caused or carried by genetic markers. This section is philosophically interesting because it asks you if you’re sure that you want to see these results. Some day we might have the ability to forecast exactly what our cause of death may be, what degenerative diseases we’ll encounter when we age, or what type of cancers we’ll have to battle with. The choice to be ignorant about these things is a right some individuals might want to retain and this expression of “are you sure you want to see these results” is a peek into how this information might be presented in the future. On one hand, someone might not want to go through life awaiting a debilitation that will certainly come, choosing the be ignorant to the fact instead. However, others might choose to know what is in store for them so they can properly prepare. I’m the type of person that wants to be equipped with all the information possible. There are currently 4 of these health risks reported by 23andMe.

Genghis KhanGenetic Health Risk

My results indicate that because of a Ɛ4 genetic variant in the APOE gene, a marker associated with an increased risk for Alzheimer’s disease, that I’m at a slightly increased risk for Alzheimer’s. This is only one part of the puzzle and the work is still in the preliminary phases so nothing is certain. We’re definitely not yet in the age where bioinformatics that can accurately predict what will happen with 100% certainty. The research goes on to state that European men with this variant have a likelihood of 20-23% to develop Alzheimer’s by 85. I’m a betting man, though, and I’ll be taking my chances.

This is a good time to bring up the privacy considerations associated with these tests. When you agree to have your DNA tested by, what are essentially, these genetic databrokers, you assume some risk. These organizations own the results of your DNA tests and this allows them to build their database which allows them to accurate test other members of the site, research additional genetic markers, and connect you with your relatives. They own this data despite what inventions and data brokerage opportunities may come about in the future. It’s impossible to know for sure what amazing and/or privacy violating applications this data will be eligible for in the future. It is likely that the results of this DNA test will outlive the person being tested. The medical information that these tests produce are in an interesting legal limbo. Could insurance companies use this information when providing or quoting you coverage? If I’m at risk for something according to my DNA will my premiums be adjusted to account for this? What price would it take for 23andMe to disclose this information? What laws are in the works to protect citizens from this predatory databrokerage? Are you sure you want to hand over that DNA?

Finally, 23andMe provides “Wellness Reports” for things that take into consideration a number of genetic factors. These include things like lactose intolerance and sleep movement.

Wellness Reports

Overall, 23andMe offers a robust ancestry breakdown and is the leading edge for consumer DNA health reports. At $199 it’s the priciest of the bunch and, personally, I think it’s worth the cost for the information provided. Being able to compare the Ancestry.com information was worth it for me. The health reports were an added benefit. This is a product that keeps working for you, as well. It is constantly being updated as more research is being completed.

The Genographic Project

The last test was provided by National Geographic in the form of The Genographic Project. It’s cost is middle-of-the-road at $149.99 and it provides a few unique features and the traditional ancestry composition breakdown. As of May 2017 there are 834,322 participants in the database. I took this test in 2017 and was familiar with the results of the other two tests. I wasn’t exactly sure what to expect.

Screenshot from 2017-05-16 17-08-13
My Genographic Project results

The regional results for this test were very broad and I assume the intention was the be accurate rather than specific. I can understand The Northwestern Europe portion of the results, everything seems to fall in a range of reasonable expectation. The Southwestern Europe portion is what threw me for a loop. This hearkens back to the Ancestry.com Iberian indicators that were present in the Ancestry.com results and mysteriously absent from the 23andMe. Some marker is likely being interpreted differently by the tests. The Italy and Southern Europe results are also larger than expected. The Northeastern Europe likely refers to the Scandinavian element, likely due to the paternal haplogroup. The amount is generous compared to the 23andMe results and more conservative compared to the Ancestry.com results. This value is likely closer to the <5% side of the spectrum considering 2 of the 3 tests have placed it there. It’s interesting to see Eastern Europe mentioned, likely of paternal origin considering the Finnish, Russian influence. My impression is that this database is younger and smaller than the other two.

The Genographic Project analysis included, what I’d consider to be my favorite part of the analysis. The portion of the results looks at how long ago you shared the same ancestor with a notable figure in history.

Genius

My closest famous cousin in Leo Tolstoy of Russia. Since it is on the paternal side this is likely where the Eastern European element of the results come from. Not much scientific elements provided, though. If an individual wanted to connect the dots it would have to be done manually and there are quite a few generations of family tree to fill out between now and 12,000 years ago. This connection is quite broad. Our most common ancestor could have existed before people were writing cuneiform. Hopefully talent is genetic because I could use some literary inspiration. Another notable entry Genghis Khan.

Haplogroups are included in the analysis. The Genographic analysis seems to include additional information compared to 23andMe.

haplogroup

In addition to the haplogroup itself, the results include the amount of people that share the haplogroup. Compared to the result on 23andMe, the Genographic maternal result includes an additional demarcation with U3a1b compared to just U3a1, suggesting the haplogroup testing is more in depth. The paternal line remains the same.

The final element of the Genographic analysis is the Neanderthal markers.

Neaderthal

This number corroberates with the 23andMe numbers in that they are above the average markets present in individuals in the dataset. Not much else can be inferred from this at this time.

I felt the Genographic test was lacking compared to the other two services. It seems to be in the same vein as 23andMe but doesn’t provide the health data. The price could definitely be lower. I feel like, for what it offers, the price would be more appropriate at $79 rather the $150. Hopefully the service improves in the future and provides additional insight. As it stands now, you are paying to be in a database. That isn’t necessarily a bad thing because National Geographic is likely to put this data to good use.

It was a close call between 23andMe and Ancestry.com. They both have carved out a sizable niche. 23andMe has the health elements cornered and Ancestry has the robust backend for providing elements relative to family tree construction. The Genographic project didn’t provide the depth of data I have been seeing from these more mature products.

There is definitely some scientific methodology to this stuff. I saw some common patterns between all three tests. I’m not sure where my research should go at this point regarding this project. It’s likely that I’ll expand on the family tree and see what kind of interesting conclusions I can draw from it. There are still stories to be discovered in the recent past. As for the extended past, it’s all up to the researchers at this point.

Future endeavors could involve getting additional 23andMe testing done for my parents. I still have results that I could compare between the services. There’s also an interesting service called Teloyears which is focused around the health of telomeres, an element of DNA. It would probably be comparable to 23andMe and at $89 it wouldn’t be a huge loss. The science is still young and I guess what I’m trying to do is get in as many of these databases as possible so that my data and DNA is constantly being engaged with new research.

Since the DNA data has more longevity that personal DNA, who knows what could happen. I’d like to think something I provide might be able to help humanity long into the future. In the meantime, I’ll settle for mailing spit tubes across the United States for a quick laugh.

Mapping the Electoral College, Reality vs. Hypothetical

How much does your vote actually matter? This year’s presidential election was an interesting affair to say the least. The votes haven’t been completely counted as of this writing but the winner of the popular vote and the electoral are likely not be the same candidate.

The popular vote winner / electoral vote winner discrepancy isn’t unprecedented. In 2000 George Bush won the presidency despite Al Gore winning the popular vote. We’d have to back to the 1800s to find the other two instances, Benjamin Harrison’s electoral victory over Grover Cleveland’s popular victory and Samuel Tilden beating Ruther B Hayes, who was the winner of the popular vote. The latter was overturned in the Compromise of 1877, promising the removal of federal troops from the South in an attempt to satisfy the popular sentiment in exchange for a Hayes presidency.

Hillary Clinton will likely become the 4th presidential candidate in American history to win the popular vote but lose the electoral vote. This has brought the role of the electoral college into question in many circles. What role does the electoral college play?

In a representative democracy like the United States, people elect officials to represent their interests. The 538 electors that make up the electoral college include the 435 representatives of the house, the 100 senators representing the states, and 3 electors representing the people living in Washington D.C. Typically, the electors will vote in accordance with the popular vote but it’s interesting to note they are not legally bound to do so. The college was a system that was originally implemented to assure that states with small populations would have a fair say in the elections. Article II, section 1, clause 2 of the constitution is the origin of the electoral college’s use in elections.

Let’s look at how the electoral college represents the population.

electoral-college-2016-final

Higher population, more electoral votes. Electoral college delegate redistribution to reflect changing populations is left up to the state. Let’s take a look at population and see how it compares.

population-2015

At a glance everything looks fine. Colors are similar and correspond between the two maps. Let’s compare electoral votes and population mathematically. By dividing the population by the electoral votes we can see how much of a state’s population is represented by 1 electoral vote.

electoral-weight

Lighter colored states have lower populations per electoral vote meaning someone’s personal vote is worth more in a light-colored state than a dark-colored state. For example, voting in Wyoming, the lowest population per electoral vote, will give your vote 3.62 times more electoral weight than a vote in California, the highest population per electoral vote. This seems strange when first considering the differential. Let’s take a look at voter turnout in the 2016 election.

voter-turnout-2016

90 million voters out of the estimated 231 million that are eligible to vote didn’t vote in the 2016 general election. According to statisticbrain.com 44.4% of people didn’t exercise their right to vote, one of the most critical rights in a democratic society. In the above map we can see some interesting correlation. California’s turnout is the lowest after Hawaii, is it fair that California would receive population-based electoral votes considering the amount of voter apathy? Should Florida receive the same amount of electoral votes as New York despite having a notable higher voter turnout? Should voter turnout even matter at all when considering the allocation electoral votes? Does it play a role in reality?

Let’s adjust the electoral vote per population by the percent of voter turnout.

population turnout adjusted.png

Nothing significantly different. The Northwest voting block is relatively stronger. The Rust Belt as a region sees an increase in voting influence per person on the electoral college. The California-Wisconsin comparison made earlier has seen its ratio drop to 2.71 compared to 3.62 meaning a vote in Wyoming still carries 2.71 times the electoral influence as a vote in California.

Let’s see what the electoral college would look like if it were adjusted to reflect these numbers. If we take the total voting population of each state and redistributed the 538 electors among them, excluding D.C.

electoral-college-redrawn

Of course in reality the number of electors has to be a whole number. You can’t have .5 of an elector. A few things to consider: Florida’s electoral power has significantly increased. California’s has decreased. The Great Plains states have had their electoral influence lowered. The Rust Belt has seen an increase across the board. Of course in this scenario, changing the number of electoral votes based on voter turnout might encourage and discourage people to show up as their state’s electoral influence waxes or wanes.

Perhaps this constant evolution of the electoral college would be a viable solution. As a representative democracy, citizens should expect accurate representation every time they fill out a ballot. Maybe this kind of feedback on voter turnout is excessive and people may feel punished by the apathy of voters they may share a state with. It’s also important to consider that, in reality, each state has 2 senators and at least 1 representative of the house, making the lowest possible electoral votes for a state 3.

If the election were held with this electoral college, including D.C.’s 3 votes, Trump would have beat Clinton 312.9 to 225.1. In reality, if Michigan goes in favor of Trump, the count will be 316 to 228 which is surprisingly similar. It’s not hard to imagine some federal statisticians crunching numbers and tweaking the electoral college in a back room somewhere in Washington. It’s interesting to consider that if we remove the Rust Belt states of Pennsylvania, Ohio, Michigan, and Wisconsin, the count would be 222.6 to 222.2 proving just how much of a role the Rust Belt plays in this scenario as well as deciding the election in reality.

Nobody can predict the  future course democracy will take in our country. Perhaps the representation of the electoral college will change, be replaced, or continue on in its current state. As of now, it plays a key role in directing the will of the nation through the representatives that champion the American democratic process.

They say the devil is in the details. And the details of this democracy are definitely geographic.

Mapping American Marijuana Laws 2016

After one of the most outrageous election cycles in American history, the geography of American marijuana laws quietly changed. 3 states joined Washington, Washington D.C., Alaska, Oregon, and Colorado with successful ballot measures legalizing the recreational use of marijuana. Florida also joins the ranks of states where psychoactive variations of THC can be used for medical purposes.

marijuana-legality
Marijuana legality as of November 10th 2016 in the United States

The above map shows the updated status of marijuana legality across the nation reflecting to successful ballot iniatives of the 2016 election cycle.

The geography is interesting to consider. The west coast is the geographic bastion of legal recreational marijuana use. This isn’t hard to believe considering the historically liberal attitudes these states have held when legislating the plant. In the mid 1990s the effort to legalize medical marijuana originated in California and spread in a similar fashion.

Massachusetts, nestled in the similarly minded New England region of the northeast united states, becomes the first east coast state to legalize marijuana recreationally. Considering the trend of medical marijuana in the region, it is likely to become similar to the west coast, with legality slowly permeating through the ballots of all the northeastern states.

The southeast, more conservative with this type of legislature, allows nonpsychoactive treatment of a limited number of medical conditions. This includes THC in droper or pill form but not in its plant form.  Florida marks the first step in introducing legislation to the region in the form of psychoactive medical treatment. This is the plant form of marijuana that contains THC that many people may be familiar with.

Washinton D.C.’s previous decision to legalize recreational use is interesting considering it’s geography. It’s nestled at the crossroads of the southeast and New England, two regions which have different legislative opinions regarding the plant. D.C. is unique in that recreational use is legal but you can’t legally buy it anywhere in the district. Marijuana dispensaries are not allowed to operate within its borders.

In the middle of the country in the breadbasket of the Great Plains, marijuana remains illegal in all forms, medical or recreationally. As of today, 7 states remain where marijuana is completely illegal. If marijuana continues to be a state’s decision, these states will likely be the last to legalize medically and recreationally, if they ever decide to

Mapping Dropout Rates in Charlotte, NC

“Education is the most powerful weapon which you can use to change the world.”

-Nelson Mandela

This project looked into the dropout rates in the city of Charlotte over 4 different years; 2004, 2008, 2010, 2012.

I used ArcMap to map the dropout rates that were reported in the Quality of life reports the city of Charlotte publishes yearly.

Quality of Life reports

These pdf reports are deprecated after the release of the new GIS applet to report this data.

Quality of Life Explorer

Methodology

I created a spreadsheet to curate the data of several of the reports.

I then compared the data to create the values of change. Where no values were present, I left the value as null. If there was no data in 2008 but data in both 2004 and 2010, I compared 2004 directly to 2010.

For 2012 I used the included spreadsheet data. The shapefiles were different because the Quality of Life organization changed how they collected data, making direct comparison with the software difficult.

neighborhoods

I used the following scheme from colorbrewer for the data maps and used an included ArcMap scheme for the change maps.

color brewer.PNG

I included a map with all 4 years visible for easy comparison.

charlotte dropout map.png

To classify the data I joined the excel spreadsheet with the neighborhood shapefile, using the NSA neighborhood identifier.

classify-the-data

I used the following 6-class manual classification across the 4 maps.

classify

I used the following 9-class manual classification to map the change maps.

classify

This project gave me exposure to the manual input of data, which is mechanical and boring but I find intrinsically rewarding for some reason. I had to manually enter the data from the quality of life pdfs to a spreadsheet which was time intensive, taking about a hour per report. In the future if I’m ever parsing data in this format, I’ll use an autoscrolling feature to automatically scroll the reports while entering the data at the save time. This, in theory, would take half as long to enter the data.  This exposure to data entry opens the door to other presentations of data through ArcMap and other data manipulation applications in the future.