Working with GIS and Python

Python is a powerful scripting language that allows users to script repetitive tasks and automate system behaviors. Python is not object oriented, differentiating its development and operation from languages like C++ and Java. The scripting syntax of Python might ease the learning curve for those new to programming concepts. GIS is a great introductory to Python programming. Please excuse any formatting errors. Some indentation was lost when copying over.

 

GIS_analysis_esri.jpg

 

ArcGIS has robust support for Python, allowing many GIS methods to be automated for optimized development with the ArcMap software framework. ArcPy is a python module that allows python to directly interface with ArcGIS software, giving the user powerful scripting capabilities within the ESRI software ecospace. ArcMap has a built in script editor which provides a graphical interface users can use to construct scripts without the default python shell. This feature is called Model Builder, and it makes the relationship between python and the ArcGIS toolkit easier to understand and implement for the visual thinker.

I provided examples of my own work that have been written in either Model Builder or in the Python IDE. I tried to keep the examples strictly geographic for this post. These scripts aren’t guaranteed to work flawlessly or gracefully. This is a continued learning experience for me and any constructive criticism is welcome.

Here’s an example of what I’ve found possible using python and ArcPy.

 


# -*- coding: utf-8 -*-
# ---------------------------------------------------------------------------
# 50m.py
# Created on: 2017-03-28 15:18:18.00000
# (generated by ArcGIS/ModelBuilder)
# Description:
# ---------------------------------------------------------------------------

# Import arcpy module
import arcpy

# Local variables:
Idaho_Moscow_students = "Idaho_Moscow_students"
Idaho_Moscow_busstops_shp = "H:\\Temp\\Idaho_Moscow_busstops.shp"
busstops_50m_shp = "H:\\Temp\\busstops_50m.shp"
within_50m_shp = "H:\\Temp\\within_50m.shp"

# Process: Buffer
arcpy.Buffer_analysis(Idaho_Moscow_busstops_shp, busstops_50m_shp, "50 Meters", "FULL", "ROUND", "ALL", "", "PLANAR")

# Process: Intersect
arcpy.Intersect_analysis("Idaho_Moscow_students #;H:\\Temp\\busstops_50m.shp #", within_50m_shp, "ALL", "", "INPUT")

 

The output is tidy and properly commented by default, saving the user the usual time it takes to make the code tidy and functionally legible. It also includes a proper header and the location of map assets. All of this is done on the fly, making sure quality code is produced every time. This is a great reason to use model builder over manually programming in in the IDE.

The script above takes a dataset containing spatial information about students and bus stops in Moscow, Idaho, applies a 50 meter buffer to the bus stops and creates a shapefile of all the students that intersect this buffer. This information can then be reapplied by entities involved in either of these operations, meaning, operations can be applied to this newly created 50m layer on the fly. We can then increment the data using the model builder to create shapefiles for different buffers.

The benefit of this over manually creating the shapefile is the obscene amount of time saved. Depending on how thorough the GIS is, each one of these points might need its own shapefile or aggregation of shapefiles. This script runs the necessary 100 or so scripts to create the spatial assets in a fraction of the time it would take a human.

The script below takes the same concept but changes the variables so the output is 100m instead of 50m. Segments of the code can be changed to augment the operation without starting from scratch. This makes it possible to automate the creation of these scripts, the ultimate goal.

 

# -*- coding: utf-8 -*-
# ---------------------------------------------------------------------------
# 100m.py
# Created on: 2017-03-28 15:19:04.00000
# (generated by ArcGIS/ModelBuilder)
# Description:
# ---------------------------------------------------------------------------

# Import arcpy module
import arcpy

# Local variables:
Idaho_Moscow_students = "Idaho_Moscow_students"
Idaho_Moscow_busstops_shp = "H:\\Temp\\Idaho_Moscow_busstops.shp"
busstops_100m_shp = "H:\\Temp\\busstops_100m.shp"
within_100m_shp = "H:\\Temp\\within_100m.shp"

# Process: Buffer
arcpy.Buffer_analysis(Idaho_Moscow_busstops_shp, busstops_100m_shp, "100 Meters", "FULL", "ROUND", "ALL", "", "PLANAR")

# Process: Intersect
arcpy.Intersect_analysis("Idaho_Moscow_students #;H:\\Temp\\busstops_100m.shp #", within_100m_shp, "ALL", "", "INPUT")

 

This example with a 100m buffer instead of a 50m buffer, can either be changed in model builder itself, manually using the replace function in your favorite text editor, or changed in ArcMap’s model builder. By changing one variable we have another porperly formatted script saving time that would have been spent manually operating the tools in the ArcMap workspace. This can be further developed to take input from the user and running the tools directly through arcpy, allowing for the possibility of “headless” GIS operations without the need to design manually.

This functionality extends to database operations. In the following script shapefiles are created by attributes in a table.

 


# ---------------------------------------------------------------------------
# airport_buffer.py
# Created on: 2017-03-30 09:06:41.00000
# (generated by ArcGIS/ModelBuilder)
# Description:
# ---------------------------------------------------------------------------

# Import arcpy module
import arcpy

 

# Local variables:
airports = "airports"
airports__4_ = airports
Airport_buffer_shp = "H:\\Ex7\\Exercise07\\Challenge\\Airport_buffer.shp"

# Process: Select Layer By Attribute
arcpy.SelectLayerByAttribute_management(airports, "NEW_SELECTION", "\FEATURE\" = 'Airport'")

# Process: Buffer
arcpy.Buffer_analysis(airports__4_, Airport_buffer_shp, "15000 Meters", "FULL", "ROUND", "ALL", "", "PLANAR")

 

This script finds all attributes labeled “airport” in a dataset and creates a 15km buffer around each one. By integrated SQL queries, data can easily be parsed and presented. All of this code be generated using model builder in the ArcMap client. Efficient scripting comes in the form of the efficient application of the python functional logic with a clear realization of the objective to be achieved.

 


import arcpy
arcpy.env.workspace = "H:/Exercise12"

def countstringfields():
fields = arcpy.ListFields("H:/Exercise12/streets.shp"," ", {"String"})
namelist = []
for field in fields:
	namelist.append(field.type)
	print(len(namelist))

countstringfields

 

This script counts the number of “string” fields in a table. The function “countstringfields” starts by locating the “String” attribute in the attribute table of a shapefile. Next a list of names in defined. A loop then appends the type “String” to a list. The variable “fields” instructs the loop to run through the entire list of strings, essentially counting them “by hand”. The resultant count is then printed for the user, all out of the ArcMap client. This script can be further developed by introducing variables for the shapefile and datatype read from user input. The proper use of indentation and whitespace is an important part of Python syntax so when things like nested loops are used, special consideration should be taken. Scripts can also be used to update datasets in addition to parsing them.

 

<b></b>

import arcpy
from arcpy import env
env.workspace = "H:/Ex7/Exercise07/"
fc = "Results/airports.shp"
cursor = arcpy.da.UpdateCursor(fc, ["TOT_ENP"])
for row in cursor:
	if row[0]  a + b) or (a &gt; b + c) or (b &gt; a + c):
	print "Valid: No"
else:
	print "Valid: Yes"

##inputs

print("Feeding program lists of measurements...")

while count &lt; 4:
	if count is 0:
		print listA
		a, b, c = listA
		count = count + 1
		triangle_type(a, b, c)
		triangle_validity(a, b, c)
	elif count is 1:
		print listB
		a, b, c = listB
		count = count + 1
		triangle_type(a, b, c)
		triangle_validity(a, b, c)
	elif count is 2:
		print listC
		a, b, c = listC
		count = count + 1
		triangle_type(a, b, c)
		triangle_validity(a, b, c)
	else:
		print listD
		a, b, c = listD
		count = count + 1
		triangle_type(a, b, c)
		triangle_validity(a, b, c)

 

The following script was fun to make.

This script accepts input in the form of multiple lists. These lists are preprogrammed in this case but the could read user input or read input from a text file by using the read method in python. The while loop uses a counter to track how many times it has been run. The loop is nested with with some conditional elements.

 

import csv

yield1999 = []
yield2000 = []
yield2001 = []

f = open('C:\Users\jdean32\Downloads\yield_over_the_years.csv')
csv_f = csv.reader(f)
next(csv_f)
for row in csv_f:
yield1999.append(row[0])
yield2000.append(row[1])
yield2001.append(row[2])

yield1999 = map(float, yield1999)
yield2000 = map(float, yield2000)
yield2001 = map(float, yield2001)

f.close()

print("1999: %s") %(yield1999)
print("2000: %s") %(yield2000)
print("2001: %s") %(yield2001)

year1999 = 1999
max_value_1999 = max(yield1999)
min_value_1999 = min(yield1999)
avg_value_1999 = sum(yield1999)/len(yield1999)
print("\nIn %s, %s was the maximum yield, %s was the minimum yield and %s was the average yield.") %(year1999, max_value_1999, min_value_1999, avg_value_1999)

year2000 = 2000
max_value_2000 = max(yield2000)
min_value_2000 = min(yield2000)
avg_value_2000 = sum(yield2000)/len(yield2000)
print("In %s, %s was the maximum yield, %s was the minimum yield and %s was the average yield.") %(year2000, max_value_2000, min_value_2000, avg_value_2000)

year2001 = 2001
max_value_2001 = max(yield2001)
min_value_2001 = min(yield2001)
avg_value_2001 = sum(yield2001)/5
print("In %s, %s was the maximum yield, %s was the minimum yield and %s was the average yield.") %(year2001, max_value_2001, min_value_2001, avg_value_2001)

 

Like I said before, this was fun to make. Always eager to take the road less traveled I thought of the most obtuse way to make this calculation. The objective of the above script was the read text from a file and compare 3 years of agriculture data. The script then finds which year had the minimum yield, maximum yield, average yield. This is all accomplished with a quick for loop, relying on several sets of variables to make sure the final answers are correct. This program can ingest different kinds of input so changing the text file, or the location where it is looked for will produce different results from the same script. Different data can be automatically ran through this particular operation.

 

yield1999 = [3.34, 21.8, 1.34, 3.75, 4.81]
yield2000 = [4.07, 4.51, 3.9, 3.63, 3.15]
yield2001 = [4.21, 4.29, 4.64, 4.27, 3.55]

location1 = (yield1999[0] + yield2000[0] + yield2001[0])/3
location2 = (yield1999[1] + yield2000[1] + yield2001[1])/3
location3 = (yield1999[2] + yield2000[2] + yield2001[2])/3
location4 = (yield1999[3] + yield2000[3] + yield2001[3])/3
location5 = (yield1999[4] + yield2000[4] + yield2001[4])/3

locations = [location1, location2, location3, location4, location5]
count = 1

text_fileA = open("C:\Temp\OutputA.txt", "w")

for i in locations:
text_fileA.write(("The average yield at location %s between 1999 and 2001 %.2f\n") %(count, i))
count = count + 1

text_fileA.close()

max1999 = (yield1999.index(max(yield1999))+1)
max2000 = (yield2000.index(max(yield2000))+1)
max2001 = (yield2001.index(max(yield2001))+1)
min1999 = (yield1999.index(min(yield1999))+1)
min2000 = (yield2000.index(min(yield2000))+1)
min2001 = (yield2001.index(min(yield2001))+1)

minmax1999 = [1999, max1999, min1999]
minmax2000 = [2000, max2000, min2000]
minmax2001 = [2001, max2001, min2001]

minmax = [minmax1999, minmax2000, minmax2001]

text_fileB = open("C:\Temp\OutputB.txt", "w")

for i in minmax:
text_fileB.write(("In %s we yielded the least at location %s and the most at location %s.\n") %(i[0], i[2], i[1]))

text_fileB.close()

 

Another attempt at the agriculture problem. Versioning is something I find useful not only for keeping a record of changes but also for keeping track of progress. This was the 4th versioning of this script and I think it turned out very unorthodox, something I find most intersting about coding: you can find multiple approaches to complete an objective. The two scripts above are similar and approached in different ways. This script uses a for loop to run through a conextually sensitive amount of inputs. The values were hardcoded into the program as variables at the start of the script. They could be read from a file if necessary.

The following script looks for the basin layer in a ArcMap file and clips the soils layer using the basin layer. This produces an area where both the soil layer and the basin layer is present. From this clipped soil layer, the script goes on to select the features from a set of attributes that are "Not Prime Farmland". This is useful for property development where the amount of farmland available is a consideration.

 

 

import arcpy

print "Starting"

soils = "H:\\Final_task1\\soils.shp"
basin = "H:\\Final_task1\\basin.shp"
basin_Clip = "C:\\Users\\jdean32\\Documents\\ArcGIS\\Default.gdb\\basin_Clip"
task1_result_shp = "H:\\task1_result.shp"

arcpy.Clip_analysis(soils, basin, basin_Clip, "")

arcpy.Select_analysis(basin_Clip, task1_result_shp, "FARMLNDCL = 'Not prime farmland'")

print "Completed"

 

The next script clips all feature classes from a folder called "USA" according to the Iowa state boundary. It then places them in a new folder. This is useful if you have country-wide data but only want to present the data from a particular area, in this case Iowa.

The script will automatically read all shapefiles in the USA folder, no matter the amount.

 

 

import arcpy

sourceUSA = "H:\\Final_task2\\USA"
sourceIowa = "H:\\Final_task2\\Iowa"
iowaBoundary = "H:\\Final_task2\\Iowa\\IowaBoundary.shp"

arcpy.env.workspace = sourceUSA
fcList = arcpy.ListFeatureClasses()

print "Starting"

for features in fcList:
outputfc = sourceIowa + "\\Iowa" + features
arcpy.Clip_analysis(features, iowaBoundary, outputfc)

print "Completed"

 

The following script finds the average population for a set of counties in a data. By dividing the total population by the number of counties, the average population is found. This is useful for calculating values in large datasets without doing it by hand.

 

 

import arcpy

featureClass = "H:\\Final_task3\\Counties.shp"

row1 = arcpy.SearchCursor(featureClass)
row2 = row1.next()

avg = 0
totalPop = 0
totalRecords = 0

while row2:
totalPop += row2.POP1990
totalRecords += 1
row2 = row1.next()

avg = totalPop / totalRecords
print "The average population of the " + str(totalRecords) + " counties is: " + str(avg)

 

The following is a modified script the calculates the driving distance between two locations. Originally the script calculated the distance between UNCC and uptown. It has been edited to calculate user input. The API is finicky so the variables have to be exact to call the right data from the API. There is reconciliation of user input in the form of replacing spaces with underscores.

 

## Script Title: Printing data from a URL (webpage)
## Author(s): CoDo
## Date: December 2, 2015

# Import the urllib2 and json libraries
import urllib2
import json
import re

originaddress = raw_input("What is the address?\n")
originstate = raw_input("What is the state?\n")
originzip = raw_input("What is the zipcode\n")
destinationaddress = raw_input("What is the destination address?\n")
destinationstate = raw_input("What is the state?\n")
destinationzip = raw_input("What is the destination zipcode\n")

print originaddress
print originstate
print originzip
print destinationaddress
print destinationstate
print destinationzip

originaddress = originaddress.replace (" ", "+")
destinationaddress = destinationaddress.replace (" ", "+"# Google API key (get it at https://code.google.com/apis/console)

google_APIkey = ##removed for security

# Read the response url of our request to get directions from UNCC to the Time Warner Cable Arena
url_address = 'https://maps.googleapis.com/maps/api/directions/json?origin=%s,%s+%s&destination=%s,%s+%s&key='% (originaddress, originstate, originzip, destinationaddress, destinationstate, destinationzip) + google_APIkey
##url_address = 'https://maps.googleapis.com/maps/api/directions/json?origin=1096+Meadowbrook+Ln+SW,NC+28027&destination=9201+University+City+Blvd,NC+28223
url_sourceCode = urllib2.urlopen(url_address).read()

# Convert the url's source code from a string to a json format (i.e. dictionary type)
directions_info = json.loads(url_sourceCode)

# Extract information from the dictionary holding the information about the directions
origin_name = directions_info['routes'][0]['legs'][0]['start_address']
origin_latlng = directions_info['routes'][0]['legs'][0]['start_location'].values()
destination_name = directions_info['routes'][0]['legs'][0]['end_address']
destination_latlng = directions_info['routes'][0]['legs'][0]['end_location'].values()
distance = directions_info['routes'][0]['legs'][0]['distance']['text']
traveltime = directions_info['routes'][0]['legs'][0]['duration']['value'] / 60

# Print a phrase that summarizes the trip
print "Origin: %s %s \nDestination: %s %s \nEstimated travel time: %s minutes" % (origin_name, origin_latlng, destination_name, destination_latlng, traveltime)

 

This next script looks for feature classes in a workspace and prints the name of each feature class and the geometry type. This would be useful for parsing datasets and looking for specific features, like polygons.

 

import arcpy
from arcpy import env
env.workspace = "H:/arcpy_ex6/Exercise06"
fclist = arcpy.ListFeatureClasses()
for fc in fclist:
fcdescribe = arcpy.Describe(fc)
print (fcdescribe.basename + " is a " + str.lower(str(fcdescribe.shapeType)) + " feature class")

 

This following scrip adds a text field to an attribute table for roads. The feature class is called ferry and is populated by either "yes" or "no" values, depending on the value of the feature field.

This is useful for quickly altering data in an attribute field or dataset without directly interfacing with the ArcMap client.

 

##libraries
import arcpy
from arcpy import env
env.workspace = "C:/Users/jdean32/Downloads/Ex7/Exercise07"

##variables
fclass = "roads.shp"
nfield = "Ferry"
ftype = "TEXT"
fname = arcpy.ValidateFieldName(nfield)
flist = arcpy.ListFields(fclass)

if fname not in flist:
arcpy.AddField_management(fclass, fname, ftype, "", "", 12)
print "Ferry attribute added."

cursor = arcpy.da.UpdateCursor(fclass, ["FEATURE","FERRY"])

for row in cursor:
if row[0] == "Ferry Crossing":
row[1] = "Yes"
else:
row[1] = "No"
cursor.updateRow(row)
del cursor

 

The following script uses some familiar functionality as the airport before near the beginning of this article. It first creates a 15,000 meter buffer around airport features in a shapefile. In addition to the buffer around airports, the script creates a 7,500 meter buffer around airports that operate seaplanes. This requires looking at the attribute table for seaplane bases. The end result is two separate buffers. A picture says 1,000 words, by having two buffers we are multiplying the amount of data that can be projected by a cartographic visualization.

 

# -*- coding: utf-8 -*-
# —————————————————————————
# airport_buffer.py
# Created on: 2017-03-30 09:06:41.00000
# (generated by ArcGIS/ModelBuilder)
# Description:
# —————————————————————————

# Import arcpy module
import arcpy

# Local variables:
airports = "airports"
airports__4_ = airports
Airport_buffer_shp = "H:\\Ex7\\Exercise07\\Challenge\\Seaplane_base_buffer.shp"

# Process: Select Layer By Attribute
arcpy.SelectLayerByAttribute_management(airports, "NEW_SELECTION", "\"FEATURE\" = 'Seaplane Base'")

# Process: Buffer
arcpy.Buffer_analysis(airports__4_, Airport_buffer_shp, "7500 Meters", "FULL", "ROUND", "ALL", "", "PLANAR")

 

Finally, we have a script the looks through a geodatabase, reads the features classes, then copies the polygon features to a new geodatabase. Once again, this makes it easy to parse and migrate data between sets.

 

import arcpy, os
arcpy.env.workspace = r'H:\arcpy_ex6\Exercise06'
fclass = arcpy.ListFeatureClasses()

outputA = r'H:\arcpy_ex6\Exercise06\testA.gdb'
outputB = r'H:\arcpy_ex6\Exercise06\testB.gdb'

for fc in fclass:
fcdesc = arcpy.Describe(fc).shapeType
outputC = os.path.join(outputA, fc)
arcpy.CopyFeatures_management(fc, outputC)
if fcdesc == 'Polygon':
outputC = os.path.join(outputB, fc)
arcpy.CopyFeatures_management(fc, outputC)

 

Python is a blessing for geographers who want to automate their work. It's logical but strict syntax allows easily legible code. It's integration with the ArcGIS suite and it's fairly simple syntax make it easy to pick up, for experts and beginners alike. Model builder abstracts the programming process and makes it easier for people who are familiar with GUI interfaces. There is little a geographer with a strong knowledge of python and mathematics can't do.

Working with Untangle Firewall

Untangle Firewall is a hardware security solution that provides a robust platform to control and observe network operations. The suite of software includes a firewall, web content blocker, routing capabilities, and many more traffic shaping features. I was interested in trying this out because I was looking for peace of mind regarding home network security. I’m pleased with how my Untangle box has been working so far. In this write-up I briefly explain my experience with different apps included in the software.

The hardware specifications for Untangled version 13 are pretty light for a small home network. The avoid any hassle I tried out a Protectli Vault, fitted with a J1900 processor, 8 GB ram, 120 GB SSD, 4 port Intel NIC for $350 at the time of this writing. It’s a workhorse and perfect for my network of about 8 – 12 devices running. It’s working with a 300/20 connection with constantly redline upload traffic. The CPU has clocked in a 50% under the heaviest load. There is definitely room to scale with this route. If I wanted to get brave I could switch out the 8GB memory stick for 16GB if the board allows it. The SSD swapfile should carry me plenty if things get rough.

Installation can be done using just a USB keyboard. In this case Untangle was loaded from a USB stick into one of the two USB connections on the Vault. Untangle charges different rates for commercial and home users. Off the gate, Untangle comes with a 14-day free trial. After the grace period it’s $50/year for the home version which includes all the “apps”. Once thing I wish it had, though, was a screenshot feature.

 

collage-2017-08-08.png

 

Out of the box; simple and productive. The homepage can be customized to include a plethora of different visualized reports.

Network management took a second to get used to. At first I wanted to get my bearings by googling every session I saw pop up then slowly expanding the network to more devices as I felt more comfortable This led me to some interesting whois websites which provide useful domain data to compare with the built in Untangle resolution. I noted the IPs I didn’t know, using the session viewer in real time, until I had become familiar with the addresses and ranges that services on the network typically use. This type of experience with network behavior lets an administrator quickly view the status of the network by looking at the geographic or other visual representations of data. I feel the at-a-glance data visualization is a key advantage of using Untangle and software like it. I chose to investigate the different apps individually so understanding their functions became easier. At first the amount of information available was overwhelming. The software had a reasonable learning curve so that feeling was short lived.

I apologize for the screenpictures. For this particular instance I wanted to know what the oscp connection was. Google suggested it checks the validity of the certificates installed on the machine. I like the at-a-glance functionality a home screen with contextually selected apps offers. The map tickles my geographic fancy. Sometimes it’s easier to work with spatial data. Glancing at a map and noting the locations of the connections can assist with interpretation on the fly. It would be even better if you could export the state of the dashboard to a static image. Exporting the configuration of the dashboard would be beneficial, too, allowing an administrator the quickly restore the last configuration. I might be missing something, but it doesn’t seem to allow the moving of visualization tiles once they’ve been place on the dashboard. This could be a major inconvenience when reorganizing or grouping visualizations after-the-fact. The geographer in ma

At first it’s easier to misestimate the amount of connections a computer can make in a browsing session. The web page loads, the 10 or so ads and marketing services connect, the DNS is queried. With 3 internet devices browsing the internet and interfacing with media, the amount of sessions can easily reach the hundreds. I worked with each app individually until I felt like I had a solid understanding of the underlying systems. Approaching the software in this manner made it easier to understand at a functional level.

 

800px-1600x1080_apps.png

 

First up was the firewall. Through trial and error, I figured out which connected sessions were important to my computing. This was most critical component I needed security-wise. Being able to see all of the open sessions, in real-time and retroactively, gave me enough data to play with initially to get a hang for the system and understand the routine sessions on my network. The firewall lets you set rules that block traffic, let’s say I own a business and I want to block all traffic that appears to be from Facebook, this would be possible by setting custom firewall rules the block the Facebook domain. In my case I wanted to identify what exactly was going on with the background connections, windows telemetry data, time synchronization efforts, and websessions being kept alive by a browser. I identified the major, constant connections, like the one a cloud migration operation to amazon cloud drive I’m currently running. This allows the administrator to get comfortable with the network and she how it is normally shaped. Along with these connections was a constant odrive connection that was brokering the Amazon Cloud Drive upload. Connections like these that I have accounted for personally were set to bypass the firewall entirely so I could reconfigure the rules without worrying about them being taken offline. The peace of mind this device provides when auditing or preforming network forensics feels priceless.

Untangle includes two web traffic shaping apps; Web Filter and Web Monitor. A few of the apps have “lite” versions (free) and full versions (paid). The application library has a Virus Block Lite and a Virus Blocker. One is the free version and the other is included in the subscription. Untangle developers the lite version and the paid version provide additional protection when run in tandem. They might be using different databases or heuristics to identify threats between the two apps.

Web Monitor is the free app, it allows you to monitor web traffic, its origination, destination, size, associated application, etc. Web Filter is required to shape the traffic. Web filter out of the box comes with several categories of web traffic it blocks. Pornography, malware distributors, known botnets, anonymizing software are all blocked with web filter by default. Several hundred additional categories for web traffic exist to make this selection as precise as an administrator would like. There was one instance where the filter warned me before I was redirected to a malware site while sifting through freeware. This is a necessity for me. The ad blocker, which functions similar to a pi hole, catches the ads before they even make it to the client computer. Normally a user would expect the browser to block ads but that’s not the case with this in-line device. The ability to catch ads over the wire adds an additional line of defense for a traditional browser adblocker.

Intrusion prevention is another app I couldn’t live without. Intrusion prevention systems (IPS) use behavioral and signature analysis to inspect packets as they move across the network. If the signature of a communication or a behavior registers as malicious, the IPS logs and, according to the user-set rules, blocks these attempted misbehaviors. The intrusion detection was quiet while I was messing with it, which is a good sign. There were several UDP portscans and distributed portscans, originating from the Untangle box. These might be functions of the Untangle install or the intrusion detection app scanning the public IP for vulnerabilities but I’m not 100% sure. It could always be a malicious actor over the wire. Whatever the cause, these portscans were the only behaviors the intrusion prevention system picked up.

The question becomes, how thorough do you want to be when setting up rules for the apps. Let’s say a Chromecast is portscanning itself for benevolent reasons, like troubleshooting a connection. Should you allow this? Should you follow the rule of least privilege? Should Chromecast have the ability to recon your network? Security and convenience tend to be mutually exclusive to a certain degree. Knowing what your sweet spot of productivity is will allow better administration of the box.

 

collage-2017-08-09.png

Bandwidth control is something I’m still getting the hang of. One question I have is why the speed I’m getting from the bandwidth monitor app readings and the interface readings seem to be off by a factor of 10. They both seem to be presenting results in the MB/s format. No unit conversion errors detected.

I can’t speak for the banwidth app itself. There are additional apps for bandwidth shaping. WAN balancer makes sure a serving load is balanced across a number of assets. If you were running a server that needs high availability and maximized performance, you would get some use out of the feature. WAN fallover is a feature that activates a backup connection, in the case the primary WAN is unreachable. Again, these features are geared towards users with the need for traffic shaping and high-availability solutions.

There is an app for both IPsec VPN and OpenVPN. I didn’t have a chance to mess around with these. The is a webinar on the IPsec VPN hosted by Untangle on YouTube. I’m curious about the particularities because I’m eager to get this feature operational as soon as possible.

I had an interesting time with the SSL inspector. This app allows you to decrypt HTTPS sessions and intercept traffic before encrypting it again and sending it on its way. Turning this on threw SSL errors on almost all devices in the house. Things like Roku couldn’t connect to YouTube because the certificate chain was incomplete considering the Untangle box was middle-manning the connection. Luckily, it comes with a certificate creator that can serve certificates to client computers so browsers won’t think it’s a malacious redirect.

Transferring the Root certificate around was comically difficult. It couldn’t be transferred on Gmail because of security issues. Those issues might have been because Google thought the attachment was malicious, or that it’s not good OpSec to email root CA installers around, although it was for a client computer. The SSL app is able to generate an installer for Windows machines in additional to the plain cert.

I was able to move it around by putting it on Google Drive. Downloading with Edge threw all sorts of bells and whistles. At first SmartScreen said it didn’t recognize the file and threw the “are you sure you want to download” prompt? Then the warning that “this file could harm you computer” from the browser. Then Kaspersky prompted about the file. Finally, UAC was triggered. This is all in good measure, installing bogus certs on computers this way can be compromising.

SSL inspector needed to be turned off while this configuration was being done. The internet was unusable with browsers like Edge with SmartScreen because of the certificate errors. MAC addresses for devices with hardcoded certs bypassed the SSL inspector all together so they wouldn’t throw errors.

 

stuntsec_ca.png

 

SSL inspector needed to be turned off while this configuration was being done. The internet was practically unusable if the correct certs aren’t installed on the network devices.

Captive Portal and the Brand Manager apps were nice touches to include. These were probably the most fun I had playing around with. The branding manager allows you to provide stock logos that replace the default Untangle logo in the software. I designed a mockup logo for fun and really enjoyed how thorough this functionality was.

The captive portal seems to function in a similar way as the SSL inspector, though I think it uses a different certificate because it throws certificate errors on machines with the SSL inspector cert installed. The captive portal page can include your brand manager content and display and solicit agreement to a terms of service, offer the option to download the certificate and or the installer, log a user in, and brokers a number of other useful functions. Very cool if you’re trying to administer web usage.

 

Stuntman Security 2.png

 

Web Cache is something you want to consider if you’ve got the resources for it. A web cache monitors traffic and puts frequently visited elements in a cache that it can serve locally. If I’m logging on facebook every day, it’s easier, and arguably safer to store the “Facebook” logo locally and serving the local copy instead of asking the website for it. The Web Cache presents a lucrative target for attackers but luckily keeping tabs on its operation with the Untangle reporting system is easy.

There are the features that you would expect to see in home security software. Untangle’s advantage is catching threats over the wire, theoretically before they hit the client box. The complete package includes the two virus scanning apps, the Phish Blocker which I assume is some kind of DNS functionality to check URLs for malpractice. There are the two spam blocker apps which I believe work with some cloud threat database. These tools provide the same functionality as a security suite for your desktop. If you start seeing unusual malware activity you can leverage the firewall against it to really turn up the heat.

In addition to the virus and malware protection, an ad blocker is included. Like the advantage above, Untangle sees the advertising domains and blocks them before they hit the boxes behind it. I know for certain the ad blocker has been busy on my box.

Active Directory is available to further expand your capability on the local network. I didn’t have a chance to mess around with it. Most home networks don’t have active directory services running but some power users out they should get a kick out of it. I played around with policy manager for a bit. It’s useful if you want to run SSL on one group of devices and ignore others, like streaming devices. Essentially each policy runs its own set of apps and generates its own reports. Very useful for compartmentalizing your network.

A lot of the Untangle apps demand more resources as you connect more devices to the network. You need to be conscious of the box running Untangle and how scalable it is. If you’re running a Web Cache for 100 users, the resources required to manage it scales exponentially from 10 useers depending on their workflow. SSL inspector can be a problem if resources are limited while the workload increases. Intrusion detection is another relative resource hog.

I learned about DHCP and routing the hard way, which is always to most effective way. I realized I wasn’t resolving hostnames from devices that were connected to the router. A router, typically by default, sends all information upstream from one IP address. This function is twofold, first it’s because there aren’t enough IPv4 addresses to be issued to every device, and secondly, it’s safer to have the router acting as a firewall so each home device doesn’t directly face the internet.

By changing the wireless router that was behind the Untangle box to “access point” mode, it quickly differed this DHCP serving to the Untangle box. Untangle was then able to resolve the hostname for each device connected to the wifi. This allows for fine tuning of access rules and traffic shaping.

The remote functionality is robust and well-supported. Access can be tailored to the user. Users that only need access to reports are safety granted this access without enabling access to system settings. Multiple boxes can be administered from a single interface. Phone administration is possible through the browser. HTTP administration most be allowed from the client box to allow configuration on a client.

The reports app, though more of a service, is probably the most important app in the box. Reports act as the liaison between the administrator and the Untangle utilities. Graph are easily generated and data is visualized so it can be easily digested on the fly. Reports can be stored on the box for up to 365 days. You will have to account for the resource usage of maintaining this database. Reports can automatically be sent to your email inbox at an interval of your choosing. This report contains much of the top level information about the box’s performance, allow remote administration to be conducted confidently and quickly.

The configuration for each untangle install can be backed up with the Configuration Backup app. It has built in Google Drive functionality and can send and restore from the cloud, eliminating the need for panic if a box becomes physically compromised. Another scenario for this functionality would be sending a configuration template to new boxes. After installation of a new box, you would just need to select the loadout from Google Drive and hours of possible configuration could be avoided. The same backup functionality is available for reports. So essentially, if a box burns up, you just have to replace the hardware and it’s back off to the races thanks to the automated backups.

I had a great time messing around with this software. I’m very pleased with the hardware purchase. The all-in-one computer plus a year’s subscription to Untangle at home was $400. I’m enjoying it so much I’m considering a second box that I can administrate remotely. The opportunity definitely provided me a peace of mind that application solutions couldn’t. Hopefully in the future I can use some of the data for geographic projects. I’ve already started messing around with projecting some geographic data in ArcMap. Here’s to hoping for more positive experiences working with the Untangle box.

Listr – Automatic List Creation for Bash

Bash scripting is a feature of many Linux distributions. This built in scripting language allows programmers to get behind the scenes with their Linux distributions and automate repetitive or complex tasks.

I’m nostalgic over the old school feel of dialogue-based menus. I personally love a terminal program that uses lists to execute operations. Building lists in Bash can be tedious. One of the more meta applications of scripting include making scripts that write other scripts. These kinds of devoloper operations help cut costs and make work more effective with minimal effort in the future.

03.png

This is the second bash script I’ve ever written. I’m by no means a professional programmer. Some of the features are unfinished. Listr is still a work in progress. This is a learning experience for me, both in writing code and documenting its functionality. Any constructive criticism is welcome.


#!/bin/bash
##listr - automated list creation
##Josh Dean
##2017

##listr
idt=" " ##ident
flowvar=0
activedir=$testdir

##menu_main_cfg
mm1="Setup Wizard"
mm2="Directory Options"
mm3="Number of Options"
mm4=

unset inc
unset list_name
unset current_dir
unset previous_dir
echo "listr - Automated Menu Building"
echo

function menu_main {
##possible to unset all variables?
previous_dir=$current_dir
current_dir=$list_funcname
menu_main_opt=("$mm1" "$mm2" "$mm3" "$mm4" "Quit")
echo "Main Menu"
select opt in "${menu_main_opt[@]}"
do
echo
case $opt in
 ##setup wizard
 "$mm1")
 setup_wizard
 ;;
 ##Directory Options
 "$mm2")
 menu_dir_opts
 ;;
 ##How many options
 "$mm3")
 list_opts
 ;;
 ##
 "$mm4")
 echo "$mm4"
 placehold $srvr
 changeoperation $srvr server "$mm5" srvr
 ;;
 "Quit")
 exit
 ;;
 *) echo invalid option;;
 esac
 echo
 menu_main
 done
}

function list_header {
 echo "Exclude standalone header and footer? (y/n)"
 read ans
 if [ $ans = "y" ]; then
 :
 else
 flow_var=1
 dup_check
 echo "#!/bin/bash" >> $opdir
 echo >> $opdir
 fi
 echo "##$list_name" >> $opdir
 echo "##$list_name" config"" >> $opdir
}

function list_name {
 echo "Enter list name:"
 read list_name
 list_name=${list_name// /_}
 echo "Name set to:"
 echo $list_name
 update_opdir
 list_funcname="menu_""$list_name"
}

function list_opts {
echo "How many options in list?"
opt_num_int_chk
echo
echo "Creating list with $list_opts_num" "options:"
unset list_name_opt
for ((i=1;i<=$list_opts_num;++i)) do
 echo "Option $i:"
 read opt
 echo "$list_name$i=\"$opt\"" >> $opdir
 list_name_opt+=($list_name$i)
done
echo
echo "Include back option? (y/n)"
read ans
if [ $ans = "y" ]; then
 list_name_opt+=("Back")
fi
echo "Include quit option? (y/n)"
read ans
if [ $ans = "y" ]; then
 list_name_opt+=("Quit")
fi
}

function opt_num_int_chk {
read list_opts_num
if ! [[ "$list_opts_num" =~ ^[0-9]+$ ]]; then
 echo "Please enter an integer"
 list_opts
fi
}

function list_array {
echo echo >> $opdir
echo "function "$list_funcname" {" >> $opdir
echo "previous_dir=""$""current_dir" >> $opdir
echo "current_dir=$""$list_funcname" >> $opdir
echo "Enter menu title:"
read menu_title
echo "echo "\"$menu_title\" >> $opdir
echo -n $list_name"_opt" >> $opdir
echo -n "=" >> $opdir
echo -n "(" >> $opdir
tmp=0
for i in ${list_name_opt[@]}; do ##might need another $ for list_name

 if [ "$i" = "Back" ]; then
 echo -n " "\"$i\" >> $opdir
 elif [ "$i" = "Quit" ]; then
 echo -n " "\"$i\" >> $opdir
 else
 if [ "$tmp" -gt "0" ]; then
 echo -n " "\""$"$i\" >> $opdir
 else
 echo -n \""$"$i\" >> $opdir
 tmp=1
 fi
 fi
done
echo ")" >> $opdir
}

function nested_prompt {
echo "Will this list be nested? (y/n)"
read ans
if [ $ans = "y" ]; then
 echo "Name of parent list?:"
 read previous_dir
fi
}

function list_select {
echo "select opt in ""\"""$"{$list_name"_opt[@]"}\""" >> $opdir
echo do >> $opdir
echo "case ""$""opt in" >> $opdir
for i in ${list_name_opt[@]}; do ##might need another $ for list_name
 echo "$idt##"$i >> $opdir
 if [ "$i" = "Back" ]; then
 echo "$idt"\"$i\"")""" >> $opdir
 echo "$idt$idt""$previous_dir" >> $opdir ##need function call
 elif [ "$i" = "Quit" ]; then
 echo "$idt"\"$i\"")""" >> $opdir
 echo "$idt$idt""break" >> $opdir ##need part message
 else
 echo "$idt"\""$"$i\"")""" >> $opdir
 echo echo >> $opdir
 echo "$idt$idt""$i""_func" >> $opdir

 fi
 echo "$idt$idt"";;" >> $opdir
done
echo "$idt""*)" >> $opdir
echo "$idt$idt""echo invalid option;;" >> $opdir
echo "esac" >> $opdir
echo "echo" >>$opdir
echo $current_dir >> $opdir
echo "done" >> $opdir
echo "}" >> $opdir
for i in ${list_name_opt[@]}; do
 if [ "$i" = "Back" ]; then
 :
 elif [ "$i" = "Quit" ]; then
 :
 else
 echo "##$i" >> $opdir
 echo "function ""$i""_func"" {" >> $opdir
 echo "echo ""$""$i" >> $opdir
 echo "echo ""\"This is placeholder text\"" >> $opdir
 echo "}" >> $opdir
 fi
done
if [ $flow_var -gt "0" ]; then
 echo >> $opdir
 echo "##flow" >> $opdir
 echo $list_funcname >> $opdir
 flow_var=0
fi
echo "Output written to $opdir"
echo
echo "Create another list?"
read ans
if [ $ans = "y" ]; then
 list_name_opt+=("Quit")
fi
}

function update_opdir {
opdir="$activedir""/""listr_""$list_name""$inc"
}

function current_opdir {
echo "Operational directory set to $opdir"
}

function update_testdir {
read testdir
}

function update_workdir {
read workdir
}

function current_test_dir {
echo "Test directory set to $testdir"
}

function current_work_dir {
echo "Working directory set to $workdir"
}

function dir_query {
current_test_dir
current_work_dir
current_opdir
}

function check_dirs {
if [ -z "$testdir" ]; then
 echo "The test directory is not set. Set it now."
 update_testdir
fi
current_test_dir
if [ -z "$workdir" ]; then
 echo "The working directory is not set. Set it now."
 update_workdir
fi
current_work_dir
update_opdir
if [ $opdir = "listr_" ]; then
 echo "Operating Path incorrect. Select active directory."
 echo "placeholder for menu"
fi
echo "Operating path set to $opdir""$""list_name"
}

function dup_check {
if [ -a $opdir ]; then
 echo "Do you want to overwrite existing file: $opdir? (y/n)"
 read ans
 if [ $ans = "y" ]; then
 rm $opdir
 else
 echo "Append output to $opdir? (y/n)"
 read ans
 if [ $ans = "y" ]; then
 :
 else
 echo "Use incremental numbering to reconcile with existing file(s)? (y/n)"
 read ans
 if [ $ans = "y" ]; then
 dup_rec
 else
 dup_check
 fi
 fi
 fi
fi
}

function dup_rec {
if [[ -e $opdir ]]; then
 i=1 ##might need to use different variable
 while [[ -e $opdir-$i ]]; do
 let i++
 done
 inc="-$i"
 update_opdir
 echo
 current_opdir
fi
}

function setup_wizard {
echo "$mm1"
list_name
echo
list_header
echo
nested_prompt
echo
list_opts
list_array
echo
list_select
}

##dir_opts
##dir_opts config
dir_opts1="Display Current Paths"
dir_opts2="Set Working Directory"
dir_opts3="Set Test Directory"
dir_opts4="Toggle Active Directory"
dir_opts5="Unset All Directory Variables"

function menu_dir_opts {
dir_opts_opt=("$dir_opts1" "$dir_opts2" "$dir_opts3" "$dir_opts4" "$dir_opts5" "Back" "Quit")
echo "Directory Options"
select opt in "${dir_opts_opt[@]}"
do
case $opt in
 ##dir_opts1
 "$dir_opts1")
 echo
 dir_opts1_func
 ;;
 ##dir_opts2
 "$dir_opts2")
 echo
 dir_opts2_func
 ;;
 ##dir_opts3
 "$dir_opts3")
 echo
 dir_opts3_func
 ;;
 ##dir_opts4
 "$dir_opts4")
 echo
 dir_opts4_func
 ;;
 ##dir_opts5
 "$dir_opts5")
 echo
 dir_opts5_func
 ;;
 ##Back
 "Back")
 menu_main
 ;;
 ##Quit
 "Quit")
 exit
 ;;
 *)
 echo invalid option;;
esac
echo
menu_dir_opts
done
}

function dir_opts1_func {
echo $dir_opts1
dir_query
}

function dir_opts2_func {
echo $dir_opts2
update_workdir
current_work_dir
}

function dir_opts3_func {
echo $dir_opts3
update_testdir
current_test_dir
}

function dir_opts4_func {
echo $dir_opts4
menu_dir_toggle
}

function dir_opts5_func {
echo $dir_opts5
unset workdir
unset testdir
unset activedir
unset

}

##dir_toggle
##dir_toggle config
dir_toggle1="Use Working Directory"
dir_toggle2="Use Test Directory"

function menu_dir_toggle {
dir_toggle_opt=("$dir_toggle1" "$dir_toggle2" "Back" "Quit")
select opt in "${dir_toggle_opt[@]}"
do
case $opt in
 ##Use Working Directory
 "$dir_toggle1")
 echo
 dir_toggle1_func
 ;;
 ##Use Test Directory
 "$dir_toggle2")
 echo
 dir_toggle2_func
 ;;
 ##Back
 "Back")
 menu_dir_opts
 ;;
 ##Quit
 "Quit")
 exit
 ;;
 *)
 echo invalid option;;
esac
update_opdir
echo "Current operational directory:"
current_opdir
echo
echo $current_dir
done
}

function dir_toggle1_func {
echo $dir_toggle1
activedir=$workdir
}

function dir_toggle2_func {
echo $dir_toggle2
activedir=$workdir
}

##flow
check_dirs
echo
menu_main

My systems administration philosophy is that everythihng should be automated. Nothing should be too sacred to automate. In this way I’m a windfall for employers. My first objective is always automating my own objectives.

The script is called listr and it queries the user about what kind of lists need to be created and writes them into a text file so they can be implemented in other scripts. The solution is editable and scableable, allowing the users the easily edit lists that have been written with listr. This is the second “major” script I’ve written and I’m enjoying the logical predictability programming offers. If you put garbage into a program you get garbage out, reliably, everytime. If you’re logically consistent with the syntax you can do anything.

The final product is a program that can be transferred across Linux platforms to create lists on the fly. Let’s take a look at the code one line at a time.

Once the program was functional I was able to continue writing the additional features. This is congruent with the end goal: Efficiency and functionality.

Let’s take a quick look at the program in action. The program is a command line application so we launch it straight from the Bash console using the source command. This reads the script and runs it.
Since this is a first run, we’ll have to set the test and working directories. Listr can use two directories, “test” and “working”. These could be renamed to anything. The purpose of this functionality is to be able to work in two seperate directories if there’s a need to seperate the output.

01.png

We’ll set the demo test and working directories to a demo folder. Since setting these variables depends on a first run, once they’re set the program will launch into the main menu on subsequent uses. Upon subsequent runs, the working directory, test directory, and the selected operating path will be stated.
The main menu consists of 3 options, an additional placeholder option for additional features in the future, and a quit option that terminates the program.
Entering a number brings up the corresponding submenu. Let’s take a look at the setup wizard. This walks through the list creation process, legibly formats the code, and exports it to the operating directory.
The setup wizard begins by asking the user for a list name. For this example we’ll enter “Greetings”. Next, the process asks whether this lists needs to exclude a standalone header or footer. If the list is being appended to another program, a header and footer is not needed. For this example, we’ll run the list as a standalone program and choose to include these features.
Next we’re prompted to specify whether the list will be nested or not. This affects the back button. In this instance we are not.
The next step, we specify how many options will be included in the menu. In this case we’ll use 4. Next we’ll input each of the options in the menu. For our “Greetings” example, we’ll input four different greetings.
The next two prompts ask the user if they’d like to include a “Back” and “Quit” option. Since our example isn’t nested, we’ll only include the “Quit” option.]
After the navigation options, we’re prompted for a menu title. We’ll keep it simple and just name it “greeting”.

05.png
At this point our list has been created and exported to the operating directory. The user is then asked if they’d like to create another list. The process is repeatable as many times as necessary.

07.png
Let’s take look at the list listr has just created.
The bash header is included because we choose to include standalone headers. The program is ready to run out of the box (almost).
Comments are automatically written to make the code more legible. The configuration menu is provided at the top. By changing the options here, the menu can be tweaked without retooling the whole program. The Greetings_opt array will need to include any new entries, as well as the actual options in the menu. I some situations it would be faster to run the setup wizard and create another menu.
Excuse the excessive echoes that have been written to the file. This seems to be a configuration error on the terminal I’m using. The program still has a fair share of bugs. I thought it would be important to publish this as soon as possible to get experience documenting a program.
The menu function is automatically defined, named, and implemented.
The previous_dir and current_dir variables are a work in progress. The intention is to make the menu titles and back button easier to automate and implement.
The menu itself is formatted out of the box.
For easier editing, the menu options call their associated functions which are written below the menu function. Out of the box, the options have placeholder text assigned to them. For our greetings example, let’s change each one to the representative greetings. This is simple enough, requiring changes to 4 lines of the scripts in this example.
At the bottom, commented under “flow” is the original function call. This is included because we chose standalone header and footers. All the above functions are just definitions. This is the actual bit the begins the program. I’m not sure what the formal name for this part of the program would be. Excuse my lexicon if it’s wildly incorrect.
Let’s run our greeting scripts and see how it turned out.
Works like a charm! This list is ready to run as a standalone program or be implemented into another program (without the headers and footers).

12.png

Alongside the setup wizard in the listr main menu are a few additional directory options if you want to change directories after the first run or toggle between the working and test directory. This is still a work in progress.
The future plan for listr might include writing individual components of the list (just the header, just the config, just the options, etc.).
I hope someone can find use for this program. I had a great time writing it. I learned a lot about automating the writing text to files and formatting an export in the syntax of the scripting language. Here’s to hoping for more successful scripting in the future.

 

Below is the greeting menu listr created in this example. Again, please excuse the excess echoes.


#!/bin/bash

##Greetings
##Greetings config
Greetings1="Hello"
Greetings2="Good Morning"
Greetings3="Good Evening"
Greetings4="Sup"
echo
function menu_Greetings {
previous_dir=$current_dir
current_dir=$menu_Greetings
echo "greeting"
Greetings_opt=("$Greetings1" "$Greetings2" "$Greetings3" "$Greetings4" "Quit")
select opt in "${Greetings_opt[@]}"
do
case $opt in
##Greetings1
"$Greetings1")
echo
Greetings1_func
;;
##Greetings2
"$Greetings2")
echo
Greetings2_func
;;
##Greetings3
"$Greetings3")
echo
Greetings3_func
;;
##Greetings4
"$Greetings4")
echo
Greetings4_func
;;
##Quit
"Quit")
break
;;
*)
echo invalid option;;
esac
echo
menu_Greetings
done
}
##Greetings1
function Greetings1_func {
echo $Greetings1
echo "Hello!"
}
##Greetings2
function Greetings2_func {
echo $Greetings2
echo "Good morning!"
}
##Greetings3
function Greetings3_func {
echo $Greetings3
echo "Good evening!"
}
##Greetings4
function Greetings4_func {
echo $Greetings4
echo "Sup, dude!"
}

##flow
menu_Greetings

 

Mapping YouTube Views

Mapping Youtube Views

YouTube has been an entertainment phenomenon ever since it arrived on the internet in 2006. Its reach is staggering, bringing videos to every corner of the Earth. In every country of the world the word YouTube is synonymous with online entertainment. I’ve always been fascinated by the maps YouTube provided in the “statistics” section of the videos. Every country in the world would be represented on the most popular videos. It’s a shame YouTube has removed these statistics from public. Now it’s only possible to see these stats if the uploader makes them available.

youtube anayltics

Youtube has a great analytics platform for content creators. It has an interactive map built into the creator studio which is great for geographic analysis. There are ways to export this data using the API tools YouTube provides. I thought it would be fun to take this data a creator a couple maps of my own. Instead of using the API I acquired the data the old fashion way: copy and pasting.

I decided to make a map of every country except the United States. Since 95% of my views come from the United States, some methods of separating the data would make other countries almost indistinguishable on a map.

After copy and pasting the lifetime statistics from the interactive map portion of the YouTube analytics page, I added them to an excel spreadsheet and created a .csv document to add to ArcMap. There was limited parsing to be done. All the data was already organized. I removed the fields I wasn’t going to be using like watch time, average view duration, and average percentage viewed. In the future it might be interesting to map these variables but today I’m just going to focus on raw view numbers.

I’m using the template that I used for my WordPress map. It uses a basemap and a borders shapefile from thematicmapping. This easily allows me to join the csv to the shapefile table and we’re quickly off to the cartographic races.

Compared to the WordPress site, my YouTube channel has a much more impressive geographic reach. Out of the 196 countries on Earth, 134 of them have clicked on a video I host on my channel. This is great because it means I’m over halfway to completing my collection of all countries.

The map includes all of the countries except the United States with over 11,000 views. I decided to use 10 natural breaks in the colors to add more variation to the map. Experts say that the average human eye can’t differentiate more than 7 colors on a map. In this case it is purely a design choice.

YoutubeViews_sansUSA

It looks like I have to carry some business cards with me next time I go to Africa. It’s nice to see such a global reach. It feels good to know that, even for a brief second, my videos were part of someone’s life in greater global community.

Mapping WordPress Views

It’s been a year since I started writing this blog. Time, as always, seems to fly by. Blogging here has allowed me to development my writing, communication, and research skills. I thought I’d do something WordPress related to celebrate a year of success and hopefully many more to come. I thought of a quick and easy project to map the geographic locations of visitors to this blog over the last year. It’s always interesting to see what countries people or visiting from and I’m always surprised at the variety.

Data acquisition is simple for this project. WordPress make statistic available so it’s not difficult to acquire the statistics or parse the data since the provided data is pretty solid. The one thing that needs to be done is combining the 2016 and 2017 data into one set since WordPress automatically categorizes visitation statistics by year. Since this blog has only been active for 2016 and 2017, there are only two datasets to combine. This is easily done using a spreadsheet and by having the WordPress statistics available.

The data suggests growth, with 2017 already overtaking the entirety of 2016 in terms of views. It’s also interesting that 2017 is more geographically diverse, consisting of 49 unique countries compared to 31 in 2016. I decided it would be appropriate to create 3 maps, one for 2016, one for 2017, and one combing the two. This would allow one to interpret the differences between the years and see the geographic implications as a whole.

I began by exporting the data into a CSV file to be read by Arcmap. I decided on the blank world map boundaries from thematicmapping.org for a basemap. The previously prepared CSV was then attached to the basemap via the “name” entry which reconciles both data tables with the name of each country. Once the data is on the map it’s over to the quantified symbology to adjust the color scheme and create a choropleth map. I choose to break the data 7 ways and to remove the borders from the country to give it a more natural, pastel look.

In layout view the design touches are added. A title was placed at the top and the map was signed. The legend was added and I used one of the tricks I’ve found useful to format it. First I add the legend with all the default settings and get the positioning correct. After it’s in position I double check that the data components are correct. Then “covert to graphics” is selected to turn the legend into an editable collection of graphic elements. The only downside to this is that it no longer reflects changes in the data so making sure the data is correct before converting is critical. After it’s been converted, selecting “ungroup” will separate each of the graphical elements, allowing the designer to manipulate each individually. I find that this is a personally easier and more intuitive to work with. After editing, the elements can be regrouped and other elements like frames and drop shadows can be added.

Wordpress2016

Full Resolution

Making the 2017 map followed to same methodology.

Wordpress2017

Full Resolution

Combining the two datasets was the only methodological variation when making the final map.

WordpressAll

Full Resolution

At a glance, the trends seem typical. North America is represented in the data as is Europe. There is an unexpected representation in Asia which might be due to the several articles that have been written about China. It’s also neat seeing visitors from South America. The rarest country is New Caledonia, a French Territory in the Pacific about 1000 miles of the coast of eastern Australia.

In the future it would be interesting to create a map that normalizes the number of visitors according to the population of the countries. This would create a map that shows which countries visit at a higher or lower rate per capita. This would illustrate which countries are more drawn to the content on the site.

Here’s to hoping for more geographical variation in the future. Maybe one day all countries will have visited Thoughtworks.

Mapping Malicious Access Attempts

Data provides an illuminating light in the dark in the world of network security. When considering computer forensics assessments, the more data available, the better. The difference between being clueless and having a handle on a situation may depend on one critical datapoint that an administrator may or may not have. When data metrics that accompany malicious activity are missing, performing proper forensics of the situation becomes exponentially more difficult.

Operating a media server in the cloud has taught me a lot about the use and operation of internet facing devices. This is provided by a 3rd party who leases servers in a data center. This machine runs Lubuntu, a distribution of Linux. While I’m not in direct control of the network this server is operating on, I do have a lot of leeway in what data can be collected since it is “internet facing” meaning it connects directly to the WAN, allowing it to be be interacted with as if it was a standalone server.

If you’ve ever managed an internet facing service you’ll be immediately familiar with the amount of attacks targeted at your machine, seemingly out of the blue. These aren’t always manual attempts to gain access or disrupt services. These attempts are normally automated and persistent, meaning someone only has to designate a target and the botnets and other malicious actors, tasked with the heavy lifting, begin a persistent threat, an attack that is capable of operating on its own, persistently, without human interaction.

While learning to operate the server, I found myself face to face with a number of malicious attacks directed at my IP address seeking to brute force the root password in order to establish an SSH connection on the server. This would essentially be an attacker gaining complete control of the server and a strong password is the only thing sanding between the vicious world of the internet and the controlled environment of the server. This list provided a number of IP addresses which, like any good geographer, I was eager to put the data on a map to spatially analyze what part of the world these attacks were coming from to glean some information on who and why these actors were targeting my media server, an entity with little to no tangible value beyond the equipment itself.

Screenshot_20170527-000900

This log of unauthorized access attempts can be found in many mainstream Linux distributions in the /var/log/auth.log folder and by using the following bash command in the terminal it is possible to count how many malicious attempts were made by which unique IP and rank them by count.

grep "Failed password for" /var/log/auth.log | grep -Po "[0-9]+\.[0-9]+\.[0-9]+\.[0-9]+" \ | sort | uniq -c

Running
this command will allow a system administrator to quickly see which
IP addresses failed to authenticate and how how many times they
failed to do so.

Parsing operations like this allow system administrators to quickly see which IP address failed to authenticate and how many times they failed to do so. This is part of the steps that turn raw data into actionable knowledge. By turning this raw data into interpretable data we actively transforming it’s interpretability and by result its usability.

This list is easily exported to an excel spreadsheet where the IPs can be georeferenced using other sources like abuseipdb.com. Using this service I was able to link each IP address and the number of the access attempts to the geographic location associated with it at the municipal, state, and national level.

After assigning each IP address a count and a geographic location I was ready to put the data on map. Looking over the excel spreadsheet showed some obvious trends out of the gate. China seems to be a majority of the access attempts. I decided to create 3 maps. The first would be based on the city the attack originated from and a surrounding, graduated symbology that expressed the number of attacks that originated from the data point. These would allow me to see at-a-glance where the majority of the attacks globally and spatially originated.

The first map was going to be tricky. Since the georeferecing built-in to ArcMap requires a subscription to the Arc Online service to use, I decided to parse my own data. I grouped all these entries and consolidated them by city. Then went through and manually entered the coordinates for each one. This is something I’d like to find an easier solution for in the future. When working with coordinates, it’s also important to use matching coordinate systems for all features in ArcMap to avoid geographic inaccuracies.

map2b

Full resolution – http://i.imgur.com/sY0c7IJ.jpg

Something I’d like to get better at is reconciling the graduated symbology between the editing frame and the data frame. Sometimes size inacuracies can throw off the visualization of the data. This is important to consider when working with graduated symbology, like in this case, where the larger symbols are limited to 100 pts.

The second map included just countries of origination, disregarding the cities metric. This choropleth map was quick to create, requiring just a few tweaks in the spreadsheet. This would provide a quick and concise visualization of the geographic national origins of these attacks in a visually interpretable format. This would be appropriate where just including cities in the metric would be too noisy for the reader.

The following is a graphical representation of the unauthorized access attempts on a media server hosting in the cloud with the IPs resolved to the country of origin. Of the roughly 53,000 access attempts between May 15 and May 17, over 50,000 originated from China.

To represent this chloropleth map I saved the data into a .csv file and imported it into ArcMap. Then came the georeferencing. This was easily done with a join operation with a basemap that lists all the countries. The blank map shapefile was added twice. One for the join and one for that background. During the join operation I removed all the countries I didn’t have a count for. Then I sent this layer to the top layer so all the colorless empty countries would appear behind the countries with data. This is one thing I continue to love and be fascinated with about ArcMap, the number of ways to accomplish a task. You could use a different methodology for every task and find a new approach each time.

map3

Full resolution – http://i.imgur.com/XyqOexM.png

I decided the last map should be the states in China to better represent where attacks were coming from in this area of the world. The data was already assembled so I sorted the excel spreadsheet by the country column and created a new sheet with just the Chinese entries. I was able to refer to the GIS database at Harvard which I wrote about in an earlier article concerning the ChinaX MOOC they offered. This was reassuring considering my familiarity with the source. The excel spreadsheet was then consolidated and a quick join operation to the newly downloaded shapefile is all it took to display the data. A choropleth map would be appropriate for this presentation. I had to double check all the state names to make sure there were no new major provincial changes had been missed by the dataset considering the shapefile was from 1997.

map4

Full resolution – http://i.imgur.com/ZhJpHLM.png

While the data might suggest that the source of the threats are originating from China, the entities with a low number of connections might be the most dangerous. If someone attempts to connect 1 time, they might have a password that they retrieved the means of a Trojan horse or a password leaks. These are the entities that may be worth investigating. All these entries were listed in the abuseipdb database so they all had malicious associations. While these threats aren’t persistent in that they are automated, they might suggest an advanced threat or threat actor.

Some of the data retrieval might be geographically inaccurate. While georeferencing IP addresses has come a long way, it’s still not an entirely empirical solution. Some extra effort might be required to make sure the data is as accurate as possible.

How does this data help? I can turn around and take the most incessant threats and blacklist them on the firewall so they’ll be unable to even attempt to log in. Using this methodology I can begin to create a blacklist of malicious IPs that I can continue building upon in the future. This allows me to geographically create a network of IPs that might be associated with a malicious entity.

The Internet can be a dangerous place, especially for internet facing devices that aren’t protected by a router or other firewall enabled devices. Nothing is impossible to mitigate and understand for a system administrator that is armed with the correct data. The epistemological  beauty of geography is the interdisciplinary applications that can be made with almost anything. Even something is insignificant as failed access attempts can be used to paint a data-rich picture.

Comparing Genetic Results from Ancestry.com, 23andme, Genographic Project

Genetics have always been an interesting subject to me. Genes and the DNA that carries them represent something that can be traced back in time and will be around long after the individual carrying it has passed. These unique identifiers contribute extensively in the lives they create and are the invisible building blocks that make us objectively human. The fact that you can derive geographic and cultural information from genetic makeup is a fascinating contribution to the story of mankind. Certain mutations and DNA markers are geographically unique and allow geneticists and human geographers to pinpoint where, and when, on Earth these genetic differentiations occurred. It’s amazing to consider that where oral tradition, once believed to be the most effective form of relaying information between generations, has failed, science has been able to pick up the reigns and accurate surmise information that would likely have been the core component of what was passed between these generations. One’s culture, ancestors, and origin stories were a large part of ancient, familial traditions and still tug at the curiosities of modern humans, as demonstrated by the millions of people who pay to have their DNA tested by the many services that reach into the past and attempt to rekindle the ancient stories locked within the human genome.

My genetic journey began in 2014 when I became curious of my genetic origins. I knew what made me myself physically, psychologically, and culturally but I wanted to know what kind of influence my ancestry had on the many facets of my being, which of my many eccentricities had been shared, experienced, and influenced by those who came before me, and the characteristics that could theoretically be passed along after me. I wanted to know who I was at the core of my physical being. If you stripped away the cultural, environmental, temporal and geographic factors, I wanted to know what would be left and this is what I was looking for philosophically when I began looking back. Epistemologically, I’ve always enjoyed history and the personal element presented by investigating one’s own personal history created a unique and curious opportunity to consider both the history of the world as a whole and how it intertwined with the history of my ancestors. The “big picture” is comprised of many small pictures and I found myself becoming curious and motivated to discover how the small pictures in my past corroborated with the big picture of humankind.

I began by looking back into the past of both my paternal and maternal lineages, both using knowledge I had gained, first-hand sources of my elders, and the ever expanding resource that is the Internet. I quickly found myself with thousands of entries in my family tree. Each entry being a unique mystery that was fulfilling to resolve and connected intricately to the proceeding and previous mysteries. Family trees based on recorded history can only take you so far. DNA analysis picks up where that story leaves off.

Ancestry.com

Ancestry.com is a record curating service that allows users to create family trees and cite the entries using the ever-growing collection of historical records. In 2012, Ancestry.com began offering DNA testing which would allow customers to look in their past in a new way.

In 2014 I decided to try the Ancestry DNA testing kit. At the time I was unsure of my ancestry, the only information in my family lore was limited to Scottish and Irish on my maternal side and almost nothing on my paternal side besides my English surname. The exact story had been lost to time after 8 generations in the New World.

After 6 weeks, I got my results:

original
My Ancestry.com results

Going in I had no expectations so I wasn’t particularly surprised by any of the results. The majority Scandinavian result was interesting considering there was no story of it in my extended family lore. I figure these results put a lot of weight in haplogroups considering my paternal haplogroup in Scandinavian, which I found out in the 23andme testing later.

 

I went on to have the rest of my family tested to see how the results stack up. I tested my mother, my father, and my maternal grandmother. My reasoning was scientific, in that comparisons could be made due to the results and guesses on the accuracy of the test could be analyzed. It might make a difference that I was working “backwards”, submitting myself first, then my parents, then my grandparents. Typically you work your way “down” the genetic line when considering the components and relation of someone’s DNA. It’s not impossible to do but I wonder how the results would have changed if I had my grandmother tested, then my parents, then myself.

Paternal_ancestry
Paternal DNA
maternal
Maternal DNA

 

 

 

 

 

 

 

 

Looking at the results some questions are raised and the general gist of which ethnicities came from which parent is established. My original question was how can my father have 58% DNA from Great Britain and myself have only 11%. As I understood it, I should have at least half, especially considering my mother had 17%. I’m not a geneticist so my conjecture is likely not accurate. It’s also important to remember that these numbers are weighted on certain genetic markers and not necessarily indicative of pure geographic fact. These numbers are compared to native populations that still live in the area today. Looking at the results it’s also clear that practically all of my Irish and Scandinavian influence comes from my mother, despite my paternal haplogroup being Nordic. It was also interesting to see the amount of Iberian Peninsula markers in my mother’s DNA, something that has never been explained in family lore. My father’s Finland/Northwest Russian markers were interesting and likely corroborate with the European Jewish marker.

materal_grandmother
Maternal Grandmother DNA

We move farther up the genetic ladder with my maternal grandmother’s analysis. It is plain to see where the Irish influence comes from. The reduced Scandinavian numbers suggest that my maternal grandfather brought some Nordic genes to the mix. Again, there is the Iberian Peninsula influence which continues to be a mystery. This might be isolated to my maternal grandmothers side which gives me an indication of where to start looking for this influence. It seems I share the Middle East marker with my grandmother, showing that my mother could have been a carrier for this marker although it did not express itself in her results. It is possible it might have been expressed as the Caucasus marker.

Each of the three tests offers supplementary details alongside the DNA test. What I like about the Ancestry.com test in particular is the ability to export your entire genome in a plaintext file. This makes it possible to submit this textfile for additional testing, use it to look for markers in smaller curation groups, and file away for safe keeping if ever needed.

Genetic_Communities
Genetic Communities

Another component of the Ancestry.com test that has been introduced rather recently is the Genetic Communities analysis. This uses your data alongside thousands of other users’ data and records to build of profile of the “when and where” all these users might have in common. This feature has been rolled out since I took my test so it suggests that Ancestry is continuously introducing new features that can assist its customers’ research inside or outside the community.

These particular results corroborate with my research, adding another source I can use when retelling my story. It also serves as kind of an intermediary when considering the ethnicity estimate which tastes into consideration factors that are thousands of years old, and the personal research which goes back several generations. If your immediate family history is limited, this will provide some good food for thought when moving forward with research.

DNA Matches
DNA Matches

The “DNA Matches” component of the results let’s you see who shares your DNA. This service will automatically guess the degree of the relationship (1st, 2nd, 3rd cousin, etc.) and allow you to communicate with other members if you are so inclined. The results are numerous and it is continually updated as more people join the service and the database is expanded. It has functionality for adding people directly to your family tree which streamlines the process. It directly integrates with a feature that was introduced in 2015:

DNA Circles
DNA Circles

DNA Circles automatically looks through the family tree information of those who are confirmed to be related to you through DNA testing and looks for similarities. It then automatically alerts you to “circles” you may be a part of. This takes the grunt work out of manually looking through your relatives’ trees for names you recognize. It also adds another layer of corroboration that can support your independent research.

I was happy with the Ancestry.com test, happy enough to do it 4 times. Each individual test is $99 and often goes on sale for holidays like Mother’s Day, Father’s Day and Christmas. My curiosity wasn’t satiated, though, and led me to a product that had been around since 2006.

23andMe.com

23andMe differs from Ancestry.com in that its focus is medical while Ancestry.com is focused on, well, ancestry. The 23andMe test includes a traditional ethnicity test that uses a unique database and marker system, likely producing different results than the Ancestry.com test, allowing for a “second opinion” of some sort. I was personally curious to see how my results would differ using the 23andMe system compared to the Ancestry system. It is notably more expensive than Ancestry’s $99 test at a whopping $199 for the full feature test which includes all of the medical tests and results. 23andMe provides just the ethnicity testing for $99.

23andMe

The 23andMe ancestry reports add some depth compared to the Ancestry.com reports. In addition to the ancestry composition which is a percentage breakdown of your ethnic makeup, it offers a look into your paternal and maternal haplogroups as well as the amount of Neanderthal markers that are present in your DNA. It also provides the DNA matches you have with people who use the service and offers a social platform to communicate with your relatives.

imageedit_5_6293939080

The ancestry composition is interesting. All of the reports of the site give you a visual representation of your DNA. When comparing matches you’re able to see exactly which components of your DNA are shared and, pictured above, you’re able to see which chromosome segments are associated with which portion of your ancestral composition.

I was interested to see the British and Irish portion of my results at the top. This corroborates with my paternal Ancestry.com test which was overwhelmingly British. The French and German levels (Western Europe of Ancestry) were similar, adding another layer of confirmation to the results. Scandinavian, however, the majority of my reported ancestry on Ancestry.com was relegated to 0.5% suggesting there is a dramatic shift in the testing methodologies between the two sites. The East Asian results might be responsible for the slim Native American results on the Ancestry.com test. It is said that Genghis Khan has nine million descendants alive today. It might be likely that I’m one of them considering the Mongolian portion of the results. It’s also interesting to note the <0.1% of Iberian noted in the results. This is a major departure from what was suggested from Ancestry.com. The 23andMe testing methodology was not confident in the presence of Iberian markers.

The reported paternal haplogroup of I-M253 is likely responsible for the 0.5% Scandinavian result.

HG_I1_europa
I-M253 haplogroup

The map above is the likely geography where the I-M253 can be found. The paternal haplogroup is your father’s and all his male ancestor’s line all the way back until the mutation originated. My patrillineal ancestry traces back to the area in the map above and my working theory, considering all the British components, is Viking ancestry. Since there is little Scandinavian in these results and the paternal haplogroup suggests a male component in the incorporation of these genetic markers, I believe a Viking raid might be responsible for the introduction of this DNA to the British isles. If the introduction of this DNA was due to geographic and cultural drift, I believe there would be more than 0.5% because this type of DNA exchange wouldn’t be fleeting and would include more admixture over time, resulting in a larger number.

Maternal haplogroups are the other side of the coin, tracing an individuals matrillineal lineage back through all the female ancestors to a specific mutation. In my case the mutation is the U3a1 mutation. Everyone alive on Earth today can trace their maternal DNA beyond the haplogroup mutations back to one woman, Mitochondrial Eve. The U3a1 haplogroup is still in its infancy research-wise. As more people with the haplogroup are tested, the more robust the results will become. It is likely the marker that suggested the Caucasus result in my Ancestry.com maternal results.

maternalhaplogroup

I really like the scientific depth 23andMe goes into when presenting the haplogroup results. This is the final leg of the origin story. You can trace this haplogroup tree all the way back to Mitochondrial Eve, completing the ultimate beginnings of the origin story. 23andMe incorporates scientific citations into the result pages, allowing inquisitive users easy access to the original research.

The staple of the 23andMe service has to be the health reports. They are constantly rolling out new reports, as experienced by the email alerts I receive every few months. There is also a report that provides insight on traits that you’re likely to have. These 2 reports contain things like your likelihood to be able to smell asparagus, eye color, chance to have a window’s peak, and your finger length ration to name a few.

Traits

The health reports include your carrier status for, at this time, 42 known gene and disease associations. Genetics is still in its relative infancy. The human genome was completely sequenced only in 2003. That has left a short 14 years for companies like 23andMe to research the association between genes and disease. Unfortunately, not every disease has a known genetic marker. It would be nice to know (maybe) if you’re going to develop cancer based on a DNA test but it’s just not possible yet. The following are a few examples of what the report provides.

Carrier_Status

There is another section titled “Genetic Health Risks” which lets the tester known if they’re at risk for certain ailments that are associated, but not directly caused or carried by genetic markers. This section is philosophically interesting because it asks you if you’re sure that you want to see these results. Some day we might have the ability to forecast exactly what our cause of death may be, what degenerative diseases we’ll encounter when we age, or what type of cancers we’ll have to battle with. The choice to be ignorant about these things is a right some individuals might want to retain and this expression of “are you sure you want to see these results” is a peek into how this information might be presented in the future. On one hand, someone might not want to go through life awaiting a debilitation that will certainly come, choosing the be ignorant to the fact instead. However, others might choose to know what is in store for them so they can properly prepare. I’m the type of person that wants to be equipped with all the information possible. There are currently 4 of these health risks reported by 23andMe.

Genghis KhanGenetic Health Risk

My results indicate that because of a Ɛ4 genetic variant in the APOE gene, a marker associated with an increased risk for Alzheimer’s disease, that I’m at a slightly increased risk for Alzheimer’s. This is only one part of the puzzle and the work is still in the preliminary phases so nothing is certain. We’re definitely not yet in the age where bioinformatics that can accurately predict what will happen with 100% certainty. The research goes on to state that European men with this variant have a likelihood of 20-23% to develop Alzheimer’s by 85. I’m a betting man, though, and I’ll be taking my chances.

This is a good time to bring up the privacy considerations associated with these tests. When you agree to have your DNA tested by, what are essentially, these genetic databrokers, you assume some risk. These organizations own the results of your DNA tests and this allows them to build their database which allows them to accurate test other members of the site, research additional genetic markers, and connect you with your relatives. They own this data despite what inventions and data brokerage opportunities may come about in the future. It’s impossible to know for sure what amazing and/or privacy violating applications this data will be eligible for in the future. It is likely that the results of this DNA test will outlive the person being tested. The medical information that these tests produce are in an interesting legal limbo. Could insurance companies use this information when providing or quoting you coverage? If I’m at risk for something according to my DNA will my premiums be adjusted to account for this? What price would it take for 23andMe to disclose this information? What laws are in the works to protect citizens from this predatory databrokerage? Are you sure you want to hand over that DNA?

Finally, 23andMe provides “Wellness Reports” for things that take into consideration a number of genetic factors. These include things like lactose intolerance and sleep movement.

Wellness Reports

Overall, 23andMe offers a robust ancestry breakdown and is the leading edge for consumer DNA health reports. At $199 it’s the priciest of the bunch and, personally, I think it’s worth the cost for the information provided. Being able to compare the Ancestry.com information was worth it for me. The health reports were an added benefit. This is a product that keeps working for you, as well. It is constantly being updated as more research is being completed.

The Genographic Project

The last test was provided by National Geographic in the form of The Genographic Project. It’s cost is middle-of-the-road at $149.99 and it provides a few unique features and the traditional ancestry composition breakdown. As of May 2017 there are 834,322 participants in the database. I took this test in 2017 and was familiar with the results of the other two tests. I wasn’t exactly sure what to expect.

Screenshot from 2017-05-16 17-08-13
My Genographic Project results

The regional results for this test were very broad and I assume the intention was the be accurate rather than specific. I can understand The Northwestern Europe portion of the results, everything seems to fall in a range of reasonable expectation. The Southwestern Europe portion is what threw me for a loop. This hearkens back to the Ancestry.com Iberian indicators that were present in the Ancestry.com results and mysteriously absent from the 23andMe. Some marker is likely being interpreted differently by the tests. The Italy and Southern Europe results are also larger than expected. The Northeastern Europe likely refers to the Scandinavian element, likely due to the paternal haplogroup. The amount is generous compared to the 23andMe results and more conservative compared to the Ancestry.com results. This value is likely closer to the <5% side of the spectrum considering 2 of the 3 tests have placed it there. It’s interesting to see Eastern Europe mentioned, likely of paternal origin considering the Finnish, Russian influence. My impression is that this database is younger and smaller than the other two.

The Genographic Project analysis included, what I’d consider to be my favorite part of the analysis. The portion of the results looks at how long ago you shared the same ancestor with a notable figure in history.

Genius

My closest famous cousin in Leo Tolstoy of Russia. Since it is on the paternal side this is likely where the Eastern European element of the results come from. Not much scientific elements provided, though. If an individual wanted to connect the dots it would have to be done manually and there are quite a few generations of family tree to fill out between now and 12,000 years ago. This connection is quite broad. Our most common ancestor could have existed before people were writing cuneiform. Hopefully talent is genetic because I could use some literary inspiration. Another notable entry Genghis Khan.

Haplogroups are included in the analysis. The Genographic analysis seems to include additional information compared to 23andMe.

haplogroup

In addition to the haplogroup itself, the results include the amount of people that share the haplogroup. Compared to the result on 23andMe, the Genographic maternal result includes an additional demarcation with U3a1b compared to just U3a1, suggesting the haplogroup testing is more in depth. The paternal line remains the same.

The final element of the Genographic analysis is the Neanderthal markers.

Neaderthal

This number corroberates with the 23andMe numbers in that they are above the average markets present in individuals in the dataset. Not much else can be inferred from this at this time.

I felt the Genographic test was lacking compared to the other two services. It seems to be in the same vein as 23andMe but doesn’t provide the health data. The price could definitely be lower. I feel like, for what it offers, the price would be more appropriate at $79 rather the $150. Hopefully the service improves in the future and provides additional insight. As it stands now, you are paying to be in a database. That isn’t necessarily a bad thing because National Geographic is likely to put this data to good use.

It was a close call between 23andMe and Ancestry.com. They both have carved out a sizable niche. 23andMe has the health elements cornered and Ancestry has the robust backend for providing elements relative to family tree construction. The Genographic project didn’t provide the depth of data I have been seeing from these more mature products.

There is definitely some scientific methodology to this stuff. I saw some common patterns between all three tests. I’m not sure where my research should go at this point regarding this project. It’s likely that I’ll expand on the family tree and see what kind of interesting conclusions I can draw from it. There are still stories to be discovered in the recent past. As for the extended past, it’s all up to the researchers at this point.

Future endeavors could involve getting additional 23andMe testing done for my parents. I still have results that I could compare between the services. There’s also an interesting service called Teloyears which is focused around the health of telomeres, an element of DNA. It would probably be comparable to 23andMe and at $89 it wouldn’t be a huge loss. The science is still young and I guess what I’m trying to do is get in as many of these databases as possible so that my data and DNA is constantly being engaged with new research.

Since the DNA data has more longevity that personal DNA, who knows what could happen. I’d like to think something I provide might be able to help humanity long into the future. In the meantime, I’ll settle for mailing spit tubes across the United States for a quick laugh.