Writing a RAID Calculator in Python

RAIDr is a RAID calculator written in python. It accepts input from the user and calculates how a certain configuration of hard drives will be allocated across different RAID configurations. This is the first program I ever wrote and the project that got me interested in programming. It’s not the most efficient, and there are some alternate ways to approach this problem, but I’m happy with the product as it turned out. It is still incomplete, but hopefully, someone can find it useful. The code is commented with thoughts on how it should function and things that need to be done. I’m not a professional python programmer, and my methodology might not be completely “pythonic”, but this was a great project for me to gain exposure to programming and syntactical logic. Any constructive criticism is welcomed.

</pre>
## Josh Dean
## RAIDr
## Created: 2/14/2017
## Last Edit: 3/21/2017
## Known bugs:

## Global Declarations

hddnumvar = 0
hddsizevar = 0
raidvar = 0
hddwritevar = 0 ## used to mitigate reference error in RAID 10 calculation

## Functions

def hdd_num():
	global hddnumvar
	print ("\nHow many drives are in the array?") ## eventual formatting errors will come from here
	hddnumvar = input() ## necessary if variable is global?
	if hddnumvar == 1:
		print ("Error: Can't create a RAID with 1 disk.")
		hdd_num()
	elif hddnumvar &gt; 1:
		print hddnumvar, "drives in the array"
		print "----------------------- \n"
	else:
		print ("I don't know what you entered but it's incorrect.")

def hdd_size(): ##needs error parsing
	global hddsizevar
	print ("What is the capacity of the drives? (gigabyte)")
	hddsizevar = input() ## possible to use line break with input
	print hddsizevar, "raw GiB per disk"
	print "----------------------- \n"
	print("%s drives in the array of %s GiB each.") % (hddnumvar, hddsizevar) ##there was a return value here, implication, seems to be hanging here with a syntax error?
	##removed the % format for something else, seems to be working single quotations critical for functional syntax, fixed it by including the arguments in parathesis
	raid_prompt()

def raid_prompt(): ##update this to reflect actual raid configurations, calls raid_calculation, all edits and calls should start here
	print ("\n1 - RAID 0")
	print ("2 - RAID 1")
	print ("3 - RAID 5")
	print ("4 - RAID 5E")
	print ("5 - RAID 5EE")
	print ("6 - RAID 6")
	print ("7 - RAID 10")
	print ("8 - RAID 50")
	print ("9 - RAID 60 \n")
	raidvar = input("What raid configuration? \n")
	raid_calculation(raidvar)

def raid_calculation(raidvar): ## just handles the menu
	if raidvar == 1:
		hddtotal = hddsizevar * hddnumvar ## variables need to go first
		print "\n-----------------------" ## /n doesn't need a space to seperate, bad formatting, best to put this in front
		print ("RAID 0 - Striped Volume")
		print hddnumvar, "drives in the array"
		print hddsizevar, "raw GiB in the array per disk"
		print "%s raw GiB in the array total" % hddtotal
		print "Total of", hddnumvar * hddsizevar, "GiB in the RAID array." ## this need alternative wording throughout the program
		print "%s times write speed" % hddnumvar ## Can I put these two prints on one line? Multiple % variables?
		print "%s times read speed" % hddnumvar
		print "No redundancy"
		print "No hot spare"
		print "----------------------- \n"
	elif raidvar == 2:
		print "\n-----------------------"
		print ("RAID 1 - Mirrored Volume")
		print hddnumvar, "drives in the array"
		print hddsizevar, "raw GiB per disk"
		print "Total of", hddsizevar, "GiB in the array."
		print "%s times read speed" % hddnumvar
		print "No write speed increase"
		hddredunvar = hddnumvar - 1
		print "%s disk redundancy" % hddredunvar
		print "No hot spare"
		print "----------------------- \n"
	elif raidvar == 3:
		if hddnumvar &lt; 3:
			print &quot;\nYou need at least 3 disks to utilize Raid 5&quot;
			disk_num_prompt()
		else:
			print &quot;\n-----------------------&quot;
			print (&quot;RAID 5 - Parity&quot;)
			print hddnumvar, &quot;drives in the array&quot;
			print hddsizevar, &quot;raw GiB per disk&quot;
			print &quot;Total of&quot;, (hddnumvar - 1) * hddsizevar, &quot;GiB in the array.&quot;
			hddreadvar = hddnumvar - 1
			print &quot;%s times read speed&quot; % hddreadvar
			print &quot;No write speed increase&quot;
			print &quot;1 disk redundancy&quot;
			print &quot;No hot spare&quot;
			print &quot;----------------------- \n&quot;
	elif raidvar == 4:
		if hddnumvar &lt; 4:
			print &quot;\nYou need at least 4 disks to utilize RAID 5E\n&quot;
			disk_num_prompt()
		else:
			print &quot;\n-----------------------&quot;
			print (&quot;RAID 5E - Parity + Spare&quot;)
			print hddnumvar, &quot;drives in the array&quot;
			print hddsizevar, &quot;raw GiB per disk&quot;
			print &quot;Total of&quot;, (hddnumvar - 2) * hddsizevar, &quot;GiB in the array.&quot;
			hddreadvar = hddnumvar - 1
			print &quot;%s times read speed&quot; % hddreadvar
			print &quot;No write speed increase&quot;
			print &quot;1 disk redundancy&quot;
			print &quot;1 hot spare&quot;
			print &quot;----------------------- \n&quot;
	elif raidvar == 5:
		if hddnumvar &lt; 4:
			print &quot;\nYou need at least 4 disks to utilize RAID 5EE\n&quot;
			disk_num_prompt()
		else:
			print &quot;\n-----------------------&quot;
			print (&quot;RAID 5EE - Parity + Spare&quot;)
			print hddnumvar, &quot;drives in the array&quot;
			print hddsizevar, &quot;raw GiB per disk&quot;
			print &quot;Total of&quot;, (hddnumvar - 2) * hddsizevar, &quot;GiB in the array.&quot;
			hddreadvar = hddnumvar - 2
			print &quot;%s times read speed&quot; % hddreadvar
			print &quot;No write speed increase&quot;
			print &quot;1 disk redundancy&quot;
			print &quot;2 hot spare&quot;
			print &quot;----------------------- \n&quot;
	elif raidvar == 6:
		if hddnumvar &lt; 4:
			print &quot;\nYou need at least 4 disks to utilize RAID 6\n&quot;
			disk_num_prompt()
		else:
			print &quot;\n-----------------------&quot;
			print (&quot;RAID 6 - Double Parity&quot;)
			print hddnumvar, &quot;drives in the array&quot;
			print hddsizevar, &quot;raw GiB per disk&quot;
			print &quot;Total of&quot;, (hddnumvar - 2) * hddsizevar, &quot;GiB in the array.&quot;
			hddreadvar = hddnumvar - 2
			print &quot;%s times read speed&quot; % hddreadvar
			print &quot;No write speed increase&quot;
			print &quot;2 disk redundancy&quot;
			print &quot;No hot spare&quot;
			print &quot;----------------------- \n&quot;
	elif raidvar == 7:
		if hddnumvar &lt; 4:
			print &quot;\nYou need at least 4 disks to utilize RAID 10\n&quot;
			disk_num_prompt()
		elif (hddnumvar % 2 == 1):
			print &quot;\nYou need an even number of disks to utilize RAID 10\n&quot;
			disk_num_prompt()
		else:
			print &quot;\n-----------------------&quot;
			print (&quot;RAID 10 - Stripe + Mirror&quot;)
			print hddnumvar, &quot;drives in the array&quot;
			print hddsizevar, &quot;raw GiB per disk&quot;
			print &quot;Total of&quot;, (hddnumvar / 2) * hddsizevar, &quot;GiB in the array.&quot;
			hddreadvar = hddnumvar / 2 ## actual write variable calculation
			print &quot;%s times read speed&quot; % hddnumvar
			print &quot;%s write speed increase&quot; % hddreadvar
			print &quot;At least 1 disk redundancy&quot;
			print &quot;No hot spare&quot;
			print &quot;----------------------- \n&quot;
	elif raidvar == 8: ## bookmark, need formulas
		if hddnumvar &lt; 6:
			print &quot;\nYou need at least 6 disks to utilize RAID 50\n&quot;
			disk_num_prompt()
		else:
			print &quot;\n-----------------------&quot;
			print (&quot;RAID 50 - Parity + Stripe&quot;)
			print hddnumvar, &quot;drives in the array&quot;
			print hddsizevar, &quot;raw GiB per disk&quot;
			print &quot;Total of&quot;, (hddnumvar - 2) * hddsizevar, &quot;GiB in the array.&quot;
			hddreadvar = hddnumvar - 2
			##print &quot;%s times read speed&quot; % hddreadvar
			##print &quot;No write speed increase&quot; # Although overall read/write performance is highly dependent on a number of factors, RAID 50 should provide better write performance than RAID 5 alone.
			print &quot;2 disk redundancy&quot;
			print &quot;No hot spare&quot;
			print &quot;----------------------- \n&quot;
	elif raidvar == 9:
		if hddnumvar  9:
		print ("Error: Please select a number between 1 and 9")
		raid_prompt()
	elif raidvar == 0: ## additional error parsing required here
		print ("Error: Please select a number between 1 and 9")
		menu_prompt() ## ubiquitous for all loop items that aren't errors

def disk_num_prompt(): ## this will eventually need to except arguments that are context sensitive for raid type and disk requirements, perhaps handle this is the raid_calculator function
	global hddnumvar
	print "Adjust number of disks?"
	print "1 - Yes"
	print "2 - No"
	disknummenuvar = input()
	if disknummenuvar == 1:
		hddnumvar = input("\nHow many drives are in the array? \n")
		if hddnumvar == 1:
			print "Error: Can't create a RAID with 1 disk."
			hdd_num()
		elif hddnumvar &gt; 1:
			print "\nUpdated"
			print hddnumvar, "drives in the array" ## displays once for every loop, hdd_num_input for mitigation
			raid_prompt()
		else:
			print ("I don't know what you entered but it's incorrect.")
			disk_num_prompt()
	elif disknummenuvar == 2:
		raid_prompt()
	else:
		print("I don't know what you entered but it's incorrect.")
		disk_num_prompt()

#below is the menu for the end of the selected operations
def menu_prompt(): ## need additional option to go to GiB to GB converter?
	print "1 - RAID menu"
	print "2 - Quit"
	print "3 - Start Over"
	menu = input()
	if menu == 1:
		raid_prompt() ## looping, quit() function is ending script, will need revision
	elif menu == 2:
		print "Cya"
		quit()
	elif menu == 3:
		start()
	elif menu == 0:
		print "Error: Please select 1, 2, or 3 \n"
		menu_prompt()
	elif menu &gt; 3:
		print "Error: Please select 1, 2, or 3 \n"
		menu_prompt()
	else:
		print "quit fucking around" ##formatting

def data_transfer_jumpoff(): ##BOOKMARK
	print "What is the transfer speed? Gigabytes, please"
	transfervar = input()
	print"What denominator of data size?"
	print"1 - byte"
	print"2 - kilobyte"
	print"3 - megabyte"
	print"4 - gigiabyte"
	print"5 - terabyte"
	transferunit = input()
	print"How much data?"
	transferamount = input()

## Start Prompt, this needs to be expanded upon
def start(): ##easier way to reset all these variables?
	startmenu = 0
	raid_var = 0 ## should be an inline solution for this in its own function, it just works
	hddnumvar = 0
	hddsizevar = 0
	raidvar = 0
	print("\nChoose an operation") ## line break might cause formatting errors look here first
	print "1 - RAID calculator"
	print "2 - Data Transfer Calculator"
	startmenu = input()
	if startmenu == 1:
		hdd_num() ## these need to be called in a more functional manner
		hdd_size()
	elif startmenu == 2:
		data_transfer_jumpoff()
	else:
		print "Not supported\n" ## will require edit
		start()

#main scripting

start()

 

The operation of the program begins by calling the start functions. I put this function call at the bottom of the script so it would be easily accessible. start() is the last function before the initial start call. From this menu the user is asked which of two currently implemented operations they wish to perform: RAID calculator or Data Transfer Calculator. Data Transfer Calculator is still a work in progress.

system-hdd-raid.jpg

 

When RAID calculator is selected the user is queried about the number of hard drives and their capacity through the calling of two functions: hddnum and hddsize. These functions would be called upon several times during the session of the program, so I thought it would be appropriate to make them their own functions. These functions read input from the users and set the appropriate variables for use during the calculation.

Next the RAID formats available are listed, one of which the user can choose which calculation they want to perform on the previously set variables. In this instance of the code, RAID 1 through 10 work fine, but the functions for RAID 50 and 60 are missing their capacity calculations since the formulas are not as straightforward. Once the selection is made, the results of the calculations are displayed.

At the end of the operation, users are presented with several options. They can change the variables and recalculate. The RAID calculation itself could be changed. The main menu can also be called in the future to perform a data transfer calculation. In the future, it might be beneficial to pass the size of the array, and allow the calculation of the transfer speed by just asking the user for the connection speed. This data could be appended to the RAID data. It might also be beneficial to include a memory function to remember specific RAID configurations, and read it to a text file than can be loaded on subsequent runs.

Another function that might be useful is the reconciliation of GiB values and GB values. This would help if users are using an NTFS file system. It might also be useful to include other filesystem types in the calculations to get the most accurate numbers possible and maximum compatibility.

Again, this was fun to make and I find myself using it from time to time. There is still a lot of work to do if the program can stand on it’s own. Taking user input comes with an interesting set of problems that could allow certain inputs to change the functions of the program. If the user isn’t intentionally trying to break the program this shouldn’t be an issue, the instructions and commands are very clear when user input is necessary. There is some formulaic work to be done with the two newest RAID formats.

Python was a great language for me to grasp the beginning intricacies of programming. I feel like the capability to make even more intricate programs is possible. Combining the operating structure of something like RAIDr with GIS functions illustrated below would allow easy semi-automatic scripting of tasks. The sky is once again the limit.

Mapping Botnets

A botnet is collection of compromised computers controlled by an individual or a group of malicious actors. You may have heard the term “owned” thrown around online or in gaming communities. The origin of the word comes from taking control of another computer. If I get you to run a Trojan that places a backdoor on your computer, leverage this backdoor to escalate my privileges on a system, then deescalate your privileges, you no longer own your computer. Your computer is “owned” and administered by someone else, likely remotely. A botnet is just a collection of these compromised machines which work in tandem, commanding and controlling each other, pushing updates, mining cryptocurrency, running portscans, launching DDOS attacks, proxying an attacker’s connection, and hosting payloads for attacks.

The larger the botnet the better but other features like connection speed, hardware, and topology on a network ultimately define a computer’s usefulness in a botnet. Don’t get me wrong, a hacker is not going to picky about the computers that are added. There is a job for every piece of hardware.

The larger the botnet the more attention it draws. The larger botnets can also be leased out on black markets for attacks and other malicious activity. There is a sweet spot where a botnet is powerful enough to be leveraged but low key enough to fly under the radar. Flying under the radar is something Mirai, a botnet from late 2016 did relatively well. Mirai took control of IoT (Internet of Things) devices with weak passwords. These devices included TV boxes, closed circuit TV equipment, home thermostats and other “things”. These devices are set up to run without administration, so once they were owned by an attacker, they were likely to remain owned and under the radar. vDos, a DDOS botnet for hire, holds the record for the largest DDOS botnet. The owners were arrested in Israel in August 2017.

I’ve been dealing with what I suspect to be a botnet on my home network. I got lucky the other day after installing a home firewall. After blocking a suspect connection I was swarmed with thousands of attempted sessions from all over the world. My working theory is that this is some botnet using P2P networking for command and control infrastructure and it was trying to see where the computer it has lost contact with went. I was able to export this 5 minute period of connections to a csv file and plot it on ArcMap. The following map is what was produced.

 

botnet activity 1.png

I’m a firm believer that every problem should be approached, at least hypothetically, through a geographic perspective. By putting this data on a map, an additional perspective is provided that can be analyzed. Looking at this map for the first time was also surprisingly emotional for me. I have been chasing this ghost through the wire since December 2016 and, through the geographic perspective, was finally able to see and size up the possible culprit.

I had to filter out the United States from the dataset because I was running an upload to an Amazon web service which would have added inconsequential coordinates to the United States, skewing the data. This data would later be parsed and included.

Immediately I was drawn to the huge cluster in Europe. If this is truly the botnet I’ve been looking for, Europe would be a good place to start looking. There were 7000 sessions used in the dataset. I’m grateful that Untangle firewall includes longitude and latitude coordinates in the data it produces. This made the data migration easy and painless.

I got lucky again two weeks later when I got another swarm of sessions from what I assume to be the same botnet. This was, again, after I terminated a suspect connection, suggesting that this experiment is repeatable which would provide an avenue for reliable data collection. I then took to the new ArcGIS Pro 2.0 to plot some more maps. With 2 sets of data, analysis could be taken to the next level through comparison.

 

2017-08-15.png
Full Resolution

 

First I have to say that this new ArcGIS interface is beautiful. It’s reminiscent of Microsoft Office due to the ribbon toolbar layout. I found the adjustment period quick and the capability expanded compared to earlier versions and standalone ArcMap. After using ArcMap I was surprised to see how responsive and non-frustrating this was to use. I ran ArcMap on a bleeding-edge system with 16GB of RAM and saw substantial slowdown. I was able to run this suite on an old OptiPlex system with 4GB of RAM with no noticeable slowdown. It is truly a pleasure to work with.

 

botnet_activity_8_13_17_small.png
Full Resolution

 

Using the second set of data I was able to produce the map above. I went ahead and created a huge resolution image so I could see the finer geographic details of the entities involved. This dataset includes the United States because I wasn’t running any background processes at the time the data was collected. I can safely assume this map represents only suspected botnet connections. I was glad to see a similar distribution, with Europe continuing to produce the majority of the connections. The real fun begins when we combine these two datasets but first let’s take a moment to look over the patterns in the above map.

Just by looking at-a-glance we can see there is a disproportionate amount of connections originating in Europe. There seem to be 4 discernable areas of concentration in Europe: The United Kingdom, the Benelux region, The Balkans, and Moscow. Looking at the United States we see a majority of connections coming from the Northeast United States, and across the Saint Lawrence in Canada. Florida is represented, as is the Bay area and Los Angeles. Vancouver, Canada seems to have a strong representation. Connections in South America are concentrated along the mouth of Rio Plata, where the major population centers are, and the coast of Brazil. A lot of Southern American tech operations happen in this region. If there were compromised computers on the continent, this would be an appropriate area to find them.

China seems to be under represented. The last network security maps I made were overwhelmingly populated by Chinese IPs. This map seems to feature only Beijing of the three major coastal cities. The Korean peninsula seems to have a strong representation. Central and Southern Asia are not represented strong except for India and, like China, it would seem to be underrepresented considering the population and amount of internet connected devices in the country.

It turns out Singapore is a large player in the network. However, it’s not inherently apparent given Singapore’s small footprint. These point maps don’t properly represent the volume of connections for some areas where many connections originate from a small area. By using heatmaps we can combine the spatial and volume elements in an interesting way.

Next we’ll look at the combination of these two point databases.

 

botnet_activity_both_days_7_31_17_top_small.png

 

I included the lower resolution map above so the points could be easily seen. A level of detail is lost but it allows it to be easily embedded in resolution sensitive media like this webpage.

The idea here was that, since a majority of the points overlap, a comparison of changes could be made between this two week period. I parsed the United States data from the first dataset so it could be included and compared. By focusing on what dataset is layered on top, we can infer which computers were removed from the botnet, either through being cleaned up or going offline, and computers that were added to the botnet in this two week period. I’m operating under the assumption that this is a P2P botnet, so any membership queries are being performed by almost every entity in the system. I’m also assuming this data represents the entirety of the botnet.

When we put the original dataset created on 7-31-17 above the layer containing the activity on 8-13-17 we’re presented with an opportunity for temporal as well as spatial analysis.

 

botnet_activity_both_days_7_31_17_top.png
Full Resolution

 

By putting the 7-31-17 dataset on top, we’re presented with a temporal perspective in addition to the geographic perspective. Visible purple dots are not included in the first dataset or else they would be overlapped by a green dot. These visible purple dots indicate machines that have presumably been added to the botnet. With more datasets it would be able to track the growth of these networks.

botnet_activity_both_days_8_13_17_top_small.png

Above is a reprojection of the data with the 8-13-17 dataset on the top layer. The temporal perspective has shifted when we change up the ordering. Visible green dots from the first set may indicate machines that are no longer part of the botnet when the second dataset was created. Machines going offline from a botnet is plausible but it’s also possible that the machines were just offline or unable to establish a session. It’s entirely possible that even with a P2P networking scheme, the entire botnet does not ping every system that appears to go offline with every machine on the network. This would seem like a serious security error by the botnet operator. It’s also entirely possible they’re not trying to cover their tracks and employing a “spray and pray” tactic, running the botnet at full capability and not worrying about the consequences. A full resolution image is linked in the caption.

By looking at both sets under the assumption that the entire botnet revealed itself, we can see if the botnet is growing or shrinking. If their are more visible purple dots on the map where green dots are layered on top compared to the visible green dots on the map where purple dots are layered on top, the botnet is growing. If the opposite is true, the botnet is shrinking.

botnet_activity_both_days_8_13_17_top.png
Full Resolution

 

The most interesting features of these comparison maps I’ve found is the predilection for certain countries and regions. Looking at the rotation of computers, we see the Northeast United States and Florida as hotspots for this activity. The reason is not clear, but this serves as a starting point for additional research. It’s important to remember that data reflects population. Major cities all show signs of activity. Major activity concentrations can be empirically defined by normalizing populations. The activity seems to proliferate from areas where activity is already established. Perhaps there is some kind of localized worm activity used for propagation. Let’s take a look at the real elephant in the room; Europe.

 

botnet_activity_both_days_8_13_17_top_europe extent_marked.png

 

The majority of machines seem to be in Europe. There are certain regions that seem to have concentrated activity. They are marked in red above. From left to right; The UK, the Netherlands, and Hungary. There’s also concentrations in Switzerland, Northern Italy, Romania, and Bulgaria.

The main three concentrations pose interesting questions. Why is there so much activity in UK? The Netherlands concentration can be explained by the number of commercial datacenters and VPS operations. A lot of for-rent services operate out of the Netherlands making it a regular on IP address queries.  Hungary is interesting and a befuddling find. There is no dominating information systems industry in Hungary like in the Netherlands What do all these countries have in common? Why are the concentrations so specific to borders? Answering these questions will be critical in solving the mystery. Next we’ll try our hand at some spatial analysis.

 

botnet_activity_7_31_17_heatmap_std_deviation_small.png

 

A kernel density map, also known as a heatmap, shows the volume of data in geographic space. This is an appropriate spatial analysis to run alongside the point map because it reveals the volume of connections that may be buried under one point. If one point initiates 100 sessions, it’s still represented as one point. These heatmaps reveal spatial perspectives that the point maps cannot.

 

botnet_activity_7_31_17_heatmap_std_deviation_large.png
Full Resolution

 

Immediately we see some interesting volumes that were hidden in the point map. Moscow lights up in this representation, indicative that many connections came from a small geographic area. By using standard deviation to divide the data, the biggest players show up in red. The circular pattern indicates that many connections come from a small area. There is big representation in Toronto, Canada that wasn’t completely apparent on the other maps. Our focus area of UK and the Netherlands are represented. Peripheral areas like Northern France and Western Germany light up on this map, suggesting concentrated activity, perhaps in the large metro areas. Seoul Korea lights up, suggesting large volumes of connections. There is notable activity in Tokyo. Like I was saying before, Singapore lights up in this map. Singapore is a small city-state that exists on the tip of peninsular Malaysia on the Malacca Strait. Connections here would be difficult to distinguish considering the small square mileage of the city. This raises a peculiar question. Why is this botnet so particular about boundaries? Singapore is crawling with connections but neighboring Malaysia, possibly sharing some of the same internet infrastructure, is quiet on the heatmap.

 

botnet_activity_8_13_17_heatmap_std_deviation_small.png

 

As with the other maps, I created a small and a large resolution version. For these kernel density maps, there are several options to represent the data. I chose to use standard deviation and geometric delineations of the data. Each provide a unique perspective and every additional perspective might reveal something new. The geometric map “smooths” the distribution of data, showing areas that might not have been significant enough to appear in the standard deviation representation.

 

botnet_activity_8_13_17_heatmap_std_deviation_large.png
Full Resolution

 

In the future it might be beneficial to select by country borders and make a chloropleth map to show the number of sessions per country. This would reveal countries with multiple sessions from the same coordinates.

It might also be beneficial to parse the data further and add appropriate symbology and additional maps for data that was present in both sets as well as which points were unique to one set. This set of 3 maps would present the data in an additional spatial context, allowing another perspective for analysis.

As always, I will be on the hunt for additional data. The next step for this project is finding out the condition for reproducing this swarm of connections. If it does turn out to be easily reproducible, the real fun begins. Additional data would be collected at regular intervals and mapped accordingly. With more data comes more realization. Automating the data collection and mapping would be the final step. At some point a geographic perspective would be so apparent, the next steps will become clear.

Until then I’m still on the warpath. Never has research been so personal to me.

Imgur Album

Working with GIS and Python

Python is a powerful scripting language that allows users to script repetitive tasks and automate system behaviors. Python is not object oriented, differentiating its development and operation from languages like C++ and Java. The scripting syntax of Python might ease the learning curve for those new to programming concepts. GIS is a great introductory to Python programming. Please excuse any formatting errors. Some indentation was lost when copying over.

 

GIS_analysis_esri.jpg

 

ArcGIS has robust support for Python, allowing many GIS methods to be automated for optimized development with the ArcMap software framework. ArcPy is a python module that allows python to directly interface with ArcGIS software, giving the user powerful scripting capabilities within the ESRI software ecospace. ArcMap has a built in script editor which provides a graphical interface users can use to construct scripts without the default python shell. This feature is called Model Builder, and it makes the relationship between python and the ArcGIS toolkit easier to understand and implement for the visual thinker.

I provided examples of my own work that have been written in either Model Builder or in the Python IDE. I tried to keep the examples strictly geographic for this post. These scripts aren’t guaranteed to work flawlessly or gracefully. This is a continued learning experience for me and any constructive criticism is welcome.

Here’s an example of what I’ve found possible using python and ArcPy.

 


# -*- coding: utf-8 -*-
# ---------------------------------------------------------------------------
# 50m.py
# Created on: 2017-03-28 15:18:18.00000
# (generated by ArcGIS/ModelBuilder)
# Description:
# ---------------------------------------------------------------------------

# Import arcpy module
import arcpy

# Local variables:
Idaho_Moscow_students = "Idaho_Moscow_students"
Idaho_Moscow_busstops_shp = "H:\\Temp\\Idaho_Moscow_busstops.shp"
busstops_50m_shp = "H:\\Temp\\busstops_50m.shp"
within_50m_shp = "H:\\Temp\\within_50m.shp"

# Process: Buffer
arcpy.Buffer_analysis(Idaho_Moscow_busstops_shp, busstops_50m_shp, "50 Meters", "FULL", "ROUND", "ALL", "", "PLANAR")

# Process: Intersect
arcpy.Intersect_analysis("Idaho_Moscow_students #;H:\\Temp\\busstops_50m.shp #", within_50m_shp, "ALL", "", "INPUT")

 

The output is tidy and properly commented by default, saving the user the usual time it takes to make the code tidy and functionally legible. It also includes a proper header and the location of map assets. All of this is done on the fly, making sure quality code is produced every time. This is a great reason to use model builder over manually programming in in the IDE.

The script above takes a dataset containing spatial information about students and bus stops in Moscow, Idaho, applies a 50 meter buffer to the bus stops and creates a shapefile of all the students that intersect this buffer. This information can then be reapplied by entities involved in either of these operations, meaning, operations can be applied to this newly created 50m layer on the fly. We can then increment the data using the model builder to create shapefiles for different buffers.

The benefit of this over manually creating the shapefile is the obscene amount of time saved. Depending on how thorough the GIS is, each one of these points might need its own shapefile or aggregation of shapefiles. This script runs the necessary 100 or so scripts to create the spatial assets in a fraction of the time it would take a human.

The script below takes the same concept but changes the variables so the output is 100m instead of 50m. Segments of the code can be changed to augment the operation without starting from scratch. This makes it possible to automate the creation of these scripts, the ultimate goal.

 

# -*- coding: utf-8 -*-
# ---------------------------------------------------------------------------
# 100m.py
# Created on: 2017-03-28 15:19:04.00000
# (generated by ArcGIS/ModelBuilder)
# Description:
# ---------------------------------------------------------------------------

# Import arcpy module
import arcpy

# Local variables:
Idaho_Moscow_students = "Idaho_Moscow_students"
Idaho_Moscow_busstops_shp = "H:\\Temp\\Idaho_Moscow_busstops.shp"
busstops_100m_shp = "H:\\Temp\\busstops_100m.shp"
within_100m_shp = "H:\\Temp\\within_100m.shp"

# Process: Buffer
arcpy.Buffer_analysis(Idaho_Moscow_busstops_shp, busstops_100m_shp, "100 Meters", "FULL", "ROUND", "ALL", "", "PLANAR")

# Process: Intersect
arcpy.Intersect_analysis("Idaho_Moscow_students #;H:\\Temp\\busstops_100m.shp #", within_100m_shp, "ALL", "", "INPUT")

 

This example with a 100m buffer instead of a 50m buffer, can either be changed in model builder itself, manually using the replace function in your favorite text editor, or changed in ArcMap’s model builder. By changing one variable we have another porperly formatted script saving time that would have been spent manually operating the tools in the ArcMap workspace. This can be further developed to take input from the user and running the tools directly through arcpy, allowing for the possibility of “headless” GIS operations without the need to design manually.

This functionality extends to database operations. In the following script shapefiles are created by attributes in a table.

 


# ---------------------------------------------------------------------------
# airport_buffer.py
# Created on: 2017-03-30 09:06:41.00000
# (generated by ArcGIS/ModelBuilder)
# Description:
# ---------------------------------------------------------------------------

# Import arcpy module
import arcpy

 

# Local variables:
airports = "airports"
airports__4_ = airports
Airport_buffer_shp = "H:\\Ex7\\Exercise07\\Challenge\\Airport_buffer.shp"

# Process: Select Layer By Attribute
arcpy.SelectLayerByAttribute_management(airports, "NEW_SELECTION", "\FEATURE\" = 'Airport'")

# Process: Buffer
arcpy.Buffer_analysis(airports__4_, Airport_buffer_shp, "15000 Meters", "FULL", "ROUND", "ALL", "", "PLANAR")

 

This script finds all attributes labeled “airport” in a dataset and creates a 15km buffer around each one. By integrated SQL queries, data can easily be parsed and presented. All of this code be generated using model builder in the ArcMap client. Efficient scripting comes in the form of the efficient application of the python functional logic with a clear realization of the objective to be achieved.

 


import arcpy
arcpy.env.workspace = "H:/Exercise12"

def countstringfields():
fields = arcpy.ListFields("H:/Exercise12/streets.shp"," ", {"String"})
namelist = []
for field in fields:
	namelist.append(field.type)
	print(len(namelist))

countstringfields

 

This script counts the number of “string” fields in a table. The function “countstringfields” starts by locating the “String” attribute in the attribute table of a shapefile. Next a list of names in defined. A loop then appends the type “String” to a list. The variable “fields” instructs the loop to run through the entire list of strings, essentially counting them “by hand”. The resultant count is then printed for the user, all out of the ArcMap client. This script can be further developed by introducing variables for the shapefile and datatype read from user input. The proper use of indentation and whitespace is an important part of Python syntax so when things like nested loops are used, special consideration should be taken. Scripts can also be used to update datasets in addition to parsing them.

 

<b></b>

import arcpy
from arcpy import env
env.workspace = "H:/Ex7/Exercise07/"
fc = "Results/airports.shp"
cursor = arcpy.da.UpdateCursor(fc, ["TOT_ENP"])
for row in cursor:
	if row[0]  a + b) or (a &gt; b + c) or (b &gt; a + c):
	print "Valid: No"
else:
	print "Valid: Yes"

##inputs

print("Feeding program lists of measurements...")

while count &lt; 4:
	if count is 0:
		print listA
		a, b, c = listA
		count = count + 1
		triangle_type(a, b, c)
		triangle_validity(a, b, c)
	elif count is 1:
		print listB
		a, b, c = listB
		count = count + 1
		triangle_type(a, b, c)
		triangle_validity(a, b, c)
	elif count is 2:
		print listC
		a, b, c = listC
		count = count + 1
		triangle_type(a, b, c)
		triangle_validity(a, b, c)
	else:
		print listD
		a, b, c = listD
		count = count + 1
		triangle_type(a, b, c)
		triangle_validity(a, b, c)

 

The following script was fun to make.

This script accepts input in the form of multiple lists. These lists are preprogrammed in this case but the could read user input or read input from a text file by using the read method in python. The while loop uses a counter to track how many times it has been run. The loop is nested with with some conditional elements.

 

import csv

yield1999 = []
yield2000 = []
yield2001 = []

f = open('C:\Users\jdean32\Downloads\yield_over_the_years.csv')
csv_f = csv.reader(f)
next(csv_f)
for row in csv_f:
yield1999.append(row[0])
yield2000.append(row[1])
yield2001.append(row[2])

yield1999 = map(float, yield1999)
yield2000 = map(float, yield2000)
yield2001 = map(float, yield2001)

f.close()

print("1999: %s") %(yield1999)
print("2000: %s") %(yield2000)
print("2001: %s") %(yield2001)

year1999 = 1999
max_value_1999 = max(yield1999)
min_value_1999 = min(yield1999)
avg_value_1999 = sum(yield1999)/len(yield1999)
print("\nIn %s, %s was the maximum yield, %s was the minimum yield and %s was the average yield.") %(year1999, max_value_1999, min_value_1999, avg_value_1999)

year2000 = 2000
max_value_2000 = max(yield2000)
min_value_2000 = min(yield2000)
avg_value_2000 = sum(yield2000)/len(yield2000)
print("In %s, %s was the maximum yield, %s was the minimum yield and %s was the average yield.") %(year2000, max_value_2000, min_value_2000, avg_value_2000)

year2001 = 2001
max_value_2001 = max(yield2001)
min_value_2001 = min(yield2001)
avg_value_2001 = sum(yield2001)/5
print("In %s, %s was the maximum yield, %s was the minimum yield and %s was the average yield.") %(year2001, max_value_2001, min_value_2001, avg_value_2001)

 

Like I said before, this was fun to make. Always eager to take the road less traveled I thought of the most obtuse way to make this calculation. The objective of the above script was the read text from a file and compare 3 years of agriculture data. The script then finds which year had the minimum yield, maximum yield, average yield. This is all accomplished with a quick for loop, relying on several sets of variables to make sure the final answers are correct. This program can ingest different kinds of input so changing the text file, or the location where it is looked for will produce different results from the same script. Different data can be automatically ran through this particular operation.

 

yield1999 = [3.34, 21.8, 1.34, 3.75, 4.81]
yield2000 = [4.07, 4.51, 3.9, 3.63, 3.15]
yield2001 = [4.21, 4.29, 4.64, 4.27, 3.55]

location1 = (yield1999[0] + yield2000[0] + yield2001[0])/3
location2 = (yield1999[1] + yield2000[1] + yield2001[1])/3
location3 = (yield1999[2] + yield2000[2] + yield2001[2])/3
location4 = (yield1999[3] + yield2000[3] + yield2001[3])/3
location5 = (yield1999[4] + yield2000[4] + yield2001[4])/3

locations = [location1, location2, location3, location4, location5]
count = 1

text_fileA = open("C:\Temp\OutputA.txt", "w")

for i in locations:
text_fileA.write(("The average yield at location %s between 1999 and 2001 %.2f\n") %(count, i))
count = count + 1

text_fileA.close()

max1999 = (yield1999.index(max(yield1999))+1)
max2000 = (yield2000.index(max(yield2000))+1)
max2001 = (yield2001.index(max(yield2001))+1)
min1999 = (yield1999.index(min(yield1999))+1)
min2000 = (yield2000.index(min(yield2000))+1)
min2001 = (yield2001.index(min(yield2001))+1)

minmax1999 = [1999, max1999, min1999]
minmax2000 = [2000, max2000, min2000]
minmax2001 = [2001, max2001, min2001]

minmax = [minmax1999, minmax2000, minmax2001]

text_fileB = open("C:\Temp\OutputB.txt", "w")

for i in minmax:
text_fileB.write(("In %s we yielded the least at location %s and the most at location %s.\n") %(i[0], i[2], i[1]))

text_fileB.close()

 

Another attempt at the agriculture problem. Versioning is something I find useful not only for keeping a record of changes but also for keeping track of progress. This was the 4th versioning of this script and I think it turned out very unorthodox, something I find most intersting about coding: you can find multiple approaches to complete an objective. The two scripts above are similar and approached in different ways. This script uses a for loop to run through a conextually sensitive amount of inputs. The values were hardcoded into the program as variables at the start of the script. They could be read from a file if necessary.

The following script looks for the basin layer in a ArcMap file and clips the soils layer using the basin layer. This produces an area where both the soil layer and the basin layer is present. From this clipped soil layer, the script goes on to select the features from a set of attributes that are "Not Prime Farmland". This is useful for property development where the amount of farmland available is a consideration.

 

 

import arcpy

print "Starting"

soils = "H:\\Final_task1\\soils.shp"
basin = "H:\\Final_task1\\basin.shp"
basin_Clip = "C:\\Users\\jdean32\\Documents\\ArcGIS\\Default.gdb\\basin_Clip"
task1_result_shp = "H:\\task1_result.shp"

arcpy.Clip_analysis(soils, basin, basin_Clip, "")

arcpy.Select_analysis(basin_Clip, task1_result_shp, "FARMLNDCL = 'Not prime farmland'")

print "Completed"

 

The next script clips all feature classes from a folder called "USA" according to the Iowa state boundary. It then places them in a new folder. This is useful if you have country-wide data but only want to present the data from a particular area, in this case Iowa.

The script will automatically read all shapefiles in the USA folder, no matter the amount.

 

 

import arcpy

sourceUSA = "H:\\Final_task2\\USA"
sourceIowa = "H:\\Final_task2\\Iowa"
iowaBoundary = "H:\\Final_task2\\Iowa\\IowaBoundary.shp"

arcpy.env.workspace = sourceUSA
fcList = arcpy.ListFeatureClasses()

print "Starting"

for features in fcList:
outputfc = sourceIowa + "\\Iowa" + features
arcpy.Clip_analysis(features, iowaBoundary, outputfc)

print "Completed"

 

The following script finds the average population for a set of counties in a data. By dividing the total population by the number of counties, the average population is found. This is useful for calculating values in large datasets without doing it by hand.

 

 

import arcpy

featureClass = "H:\\Final_task3\\Counties.shp"

row1 = arcpy.SearchCursor(featureClass)
row2 = row1.next()

avg = 0
totalPop = 0
totalRecords = 0

while row2:
totalPop += row2.POP1990
totalRecords += 1
row2 = row1.next()

avg = totalPop / totalRecords
print "The average population of the " + str(totalRecords) + " counties is: " + str(avg)

 

The following is a modified script the calculates the driving distance between two locations. Originally the script calculated the distance between UNCC and uptown. It has been edited to calculate user input. The API is finicky so the variables have to be exact to call the right data from the API. There is reconciliation of user input in the form of replacing spaces with underscores.

 

## Script Title: Printing data from a URL (webpage)
## Author(s): CoDo
## Date: December 2, 2015

# Import the urllib2 and json libraries
import urllib2
import json
import re

originaddress = raw_input("What is the address?\n")
originstate = raw_input("What is the state?\n")
originzip = raw_input("What is the zipcode\n")
destinationaddress = raw_input("What is the destination address?\n")
destinationstate = raw_input("What is the state?\n")
destinationzip = raw_input("What is the destination zipcode\n")

print originaddress
print originstate
print originzip
print destinationaddress
print destinationstate
print destinationzip

originaddress = originaddress.replace (" ", "+")
destinationaddress = destinationaddress.replace (" ", "+"# Google API key (get it at https://code.google.com/apis/console)

google_APIkey = ##removed for security

# Read the response url of our request to get directions from UNCC to the Time Warner Cable Arena
url_address = 'https://maps.googleapis.com/maps/api/directions/json?origin=%s,%s+%s&destination=%s,%s+%s&key='% (originaddress, originstate, originzip, destinationaddress, destinationstate, destinationzip) + google_APIkey
##url_address = 'https://maps.googleapis.com/maps/api/directions/json?origin=1096+Meadowbrook+Ln+SW,NC+28027&destination=9201+University+City+Blvd,NC+28223
url_sourceCode = urllib2.urlopen(url_address).read()

# Convert the url's source code from a string to a json format (i.e. dictionary type)
directions_info = json.loads(url_sourceCode)

# Extract information from the dictionary holding the information about the directions
origin_name = directions_info['routes'][0]['legs'][0]['start_address']
origin_latlng = directions_info['routes'][0]['legs'][0]['start_location'].values()
destination_name = directions_info['routes'][0]['legs'][0]['end_address']
destination_latlng = directions_info['routes'][0]['legs'][0]['end_location'].values()
distance = directions_info['routes'][0]['legs'][0]['distance']['text']
traveltime = directions_info['routes'][0]['legs'][0]['duration']['value'] / 60

# Print a phrase that summarizes the trip
print "Origin: %s %s \nDestination: %s %s \nEstimated travel time: %s minutes" % (origin_name, origin_latlng, destination_name, destination_latlng, traveltime)

 

This next script looks for feature classes in a workspace and prints the name of each feature class and the geometry type. This would be useful for parsing datasets and looking for specific features, like polygons.

 

import arcpy
from arcpy import env
env.workspace = "H:/arcpy_ex6/Exercise06"
fclist = arcpy.ListFeatureClasses()
for fc in fclist:
fcdescribe = arcpy.Describe(fc)
print (fcdescribe.basename + " is a " + str.lower(str(fcdescribe.shapeType)) + " feature class")

 

This following scrip adds a text field to an attribute table for roads. The feature class is called ferry and is populated by either "yes" or "no" values, depending on the value of the feature field.

This is useful for quickly altering data in an attribute field or dataset without directly interfacing with the ArcMap client.

 

##libraries
import arcpy
from arcpy import env
env.workspace = "C:/Users/jdean32/Downloads/Ex7/Exercise07"

##variables
fclass = "roads.shp"
nfield = "Ferry"
ftype = "TEXT"
fname = arcpy.ValidateFieldName(nfield)
flist = arcpy.ListFields(fclass)

if fname not in flist:
arcpy.AddField_management(fclass, fname, ftype, "", "", 12)
print "Ferry attribute added."

cursor = arcpy.da.UpdateCursor(fclass, ["FEATURE","FERRY"])

for row in cursor:
if row[0] == "Ferry Crossing":
row[1] = "Yes"
else:
row[1] = "No"
cursor.updateRow(row)
del cursor

 

The following script uses some familiar functionality as the airport before near the beginning of this article. It first creates a 15,000 meter buffer around airport features in a shapefile. In addition to the buffer around airports, the script creates a 7,500 meter buffer around airports that operate seaplanes. This requires looking at the attribute table for seaplane bases. The end result is two separate buffers. A picture says 1,000 words, by having two buffers we are multiplying the amount of data that can be projected by a cartographic visualization.

 

# -*- coding: utf-8 -*-
# —————————————————————————
# airport_buffer.py
# Created on: 2017-03-30 09:06:41.00000
# (generated by ArcGIS/ModelBuilder)
# Description:
# —————————————————————————

# Import arcpy module
import arcpy

# Local variables:
airports = "airports"
airports__4_ = airports
Airport_buffer_shp = "H:\\Ex7\\Exercise07\\Challenge\\Seaplane_base_buffer.shp"

# Process: Select Layer By Attribute
arcpy.SelectLayerByAttribute_management(airports, "NEW_SELECTION", "\"FEATURE\" = 'Seaplane Base'")

# Process: Buffer
arcpy.Buffer_analysis(airports__4_, Airport_buffer_shp, "7500 Meters", "FULL", "ROUND", "ALL", "", "PLANAR")

 

Finally, we have a script the looks through a geodatabase, reads the features classes, then copies the polygon features to a new geodatabase. Once again, this makes it easy to parse and migrate data between sets.

 

import arcpy, os
arcpy.env.workspace = r'H:\arcpy_ex6\Exercise06'
fclass = arcpy.ListFeatureClasses()

outputA = r'H:\arcpy_ex6\Exercise06\testA.gdb'
outputB = r'H:\arcpy_ex6\Exercise06\testB.gdb'

for fc in fclass:
fcdesc = arcpy.Describe(fc).shapeType
outputC = os.path.join(outputA, fc)
arcpy.CopyFeatures_management(fc, outputC)
if fcdesc == 'Polygon':
outputC = os.path.join(outputB, fc)
arcpy.CopyFeatures_management(fc, outputC)

 

Python is a blessing for geographers who want to automate their work. It's logical but strict syntax allows easily legible code. It's integration with the ArcGIS suite and it's fairly simple syntax make it easy to pick up, for experts and beginners alike. Model builder abstracts the programming process and makes it easier for people who are familiar with GUI interfaces. There is little a geographer with a strong knowledge of python and mathematics can't do.

Working with Untangle Firewall

Untangle Firewall is a hardware security solution that provides a robust platform to control and observe network operations. The suite of software includes a firewall, web content blocker, routing capabilities, and many more traffic shaping features. I was interested in trying this out because I was looking for peace of mind regarding home network security. I’m pleased with how my Untangle box has been working so far. In this write-up I briefly explain my experience with different apps included in the software.

The hardware specifications for Untangled version 13 are pretty light for a small home network. The avoid any hassle I tried out a Protectli Vault, fitted with a J1900 processor, 8 GB ram, 120 GB SSD, 4 port Intel NIC for $350 at the time of this writing. It’s a workhorse and perfect for my network of about 8 – 12 devices running. It’s working with a 300/20 connection with constantly redline upload traffic. The CPU has clocked in a 50% under the heaviest load. There is definitely room to scale with this route. If I wanted to get brave I could switch out the 8GB memory stick for 16GB if the board allows it. The SSD swapfile should carry me plenty if things get rough.

Installation can be done using just a USB keyboard. In this case Untangle was loaded from a USB stick into one of the two USB connections on the Vault. Untangle charges different rates for commercial and home users. Off the gate, Untangle comes with a 14-day free trial. After the grace period it’s $50/year for the home version which includes all the “apps”. Once thing I wish it had, though, was a screenshot feature.

 

collage-2017-08-08.png

 

Out of the box; simple and productive. The homepage can be customized to include a plethora of different visualized reports.

Network management took a second to get used to. At first I wanted to get my bearings by googling every session I saw pop up then slowly expanding the network to more devices as I felt more comfortable This led me to some interesting whois websites which provide useful domain data to compare with the built in Untangle resolution. I noted the IPs I didn’t know, using the session viewer in real time, until I had become familiar with the addresses and ranges that services on the network typically use. This type of experience with network behavior lets an administrator quickly view the status of the network by looking at the geographic or other visual representations of data. I feel the at-a-glance data visualization is a key advantage of using Untangle and software like it. I chose to investigate the different apps individually so understanding their functions became easier. At first the amount of information available was overwhelming. The software had a reasonable learning curve so that feeling was short lived.

I apologize for the screenpictures. For this particular instance I wanted to know what the oscp connection was. Google suggested it checks the validity of the certificates installed on the machine. I like the at-a-glance functionality a home screen with contextually selected apps offers. The map tickles my geographic fancy. Sometimes it’s easier to work with spatial data. Glancing at a map and noting the locations of the connections can assist with interpretation on the fly. It would be even better if you could export the state of the dashboard to a static image. Exporting the configuration of the dashboard would be beneficial, too, allowing an administrator the quickly restore the last configuration. I might be missing something, but it doesn’t seem to allow the moving of visualization tiles once they’ve been place on the dashboard. This could be a major inconvenience when reorganizing or grouping visualizations after-the-fact. The geographer in ma

At first it’s easier to misestimate the amount of connections a computer can make in a browsing session. The web page loads, the 10 or so ads and marketing services connect, the DNS is queried. With 3 internet devices browsing the internet and interfacing with media, the amount of sessions can easily reach the hundreds. I worked with each app individually until I felt like I had a solid understanding of the underlying systems. Approaching the software in this manner made it easier to understand at a functional level.

 

800px-1600x1080_apps.png

 

First up was the firewall. Through trial and error, I figured out which connected sessions were important to my computing. This was most critical component I needed security-wise. Being able to see all of the open sessions, in real-time and retroactively, gave me enough data to play with initially to get a hang for the system and understand the routine sessions on my network. The firewall lets you set rules that block traffic, let’s say I own a business and I want to block all traffic that appears to be from Facebook, this would be possible by setting custom firewall rules the block the Facebook domain. In my case I wanted to identify what exactly was going on with the background connections, windows telemetry data, time synchronization efforts, and websessions being kept alive by a browser. I identified the major, constant connections, like the one a cloud migration operation to amazon cloud drive I’m currently running. This allows the administrator to get comfortable with the network and she how it is normally shaped. Along with these connections was a constant odrive connection that was brokering the Amazon Cloud Drive upload. Connections like these that I have accounted for personally were set to bypass the firewall entirely so I could reconfigure the rules without worrying about them being taken offline. The peace of mind this device provides when auditing or preforming network forensics feels priceless.

Untangle includes two web traffic shaping apps; Web Filter and Web Monitor. A few of the apps have “lite” versions (free) and full versions (paid). The application library has a Virus Block Lite and a Virus Blocker. One is the free version and the other is included in the subscription. Untangle developers the lite version and the paid version provide additional protection when run in tandem. They might be using different databases or heuristics to identify threats between the two apps.

Web Monitor is the free app, it allows you to monitor web traffic, its origination, destination, size, associated application, etc. Web Filter is required to shape the traffic. Web filter out of the box comes with several categories of web traffic it blocks. Pornography, malware distributors, known botnets, anonymizing software are all blocked with web filter by default. Several hundred additional categories for web traffic exist to make this selection as precise as an administrator would like. There was one instance where the filter warned me before I was redirected to a malware site while sifting through freeware. This is a necessity for me. The ad blocker, which functions similar to a pi hole, catches the ads before they even make it to the client computer. Normally a user would expect the browser to block ads but that’s not the case with this in-line device. The ability to catch ads over the wire adds an additional line of defense for a traditional browser adblocker.

Intrusion prevention is another app I couldn’t live without. Intrusion prevention systems (IPS) use behavioral and signature analysis to inspect packets as they move across the network. If the signature of a communication or a behavior registers as malicious, the IPS logs and, according to the user-set rules, blocks these attempted misbehaviors. The intrusion detection was quiet while I was messing with it, which is a good sign. There were several UDP portscans and distributed portscans, originating from the Untangle box. These might be functions of the Untangle install or the intrusion detection app scanning the public IP for vulnerabilities but I’m not 100% sure. It could always be a malicious actor over the wire. Whatever the cause, these portscans were the only behaviors the intrusion prevention system picked up.

The question becomes, how thorough do you want to be when setting up rules for the apps. Let’s say a Chromecast is portscanning itself for benevolent reasons, like troubleshooting a connection. Should you allow this? Should you follow the rule of least privilege? Should Chromecast have the ability to recon your network? Security and convenience tend to be mutually exclusive to a certain degree. Knowing what your sweet spot of productivity is will allow better administration of the box.

 

collage-2017-08-09.png

Bandwidth control is something I’m still getting the hang of. One question I have is why the speed I’m getting from the bandwidth monitor app readings and the interface readings seem to be off by a factor of 10. They both seem to be presenting results in the MB/s format. No unit conversion errors detected.

I can’t speak for the banwidth app itself. There are additional apps for bandwidth shaping. WAN balancer makes sure a serving load is balanced across a number of assets. If you were running a server that needs high availability and maximized performance, you would get some use out of the feature. WAN fallover is a feature that activates a backup connection, in the case the primary WAN is unreachable. Again, these features are geared towards users with the need for traffic shaping and high-availability solutions.

There is an app for both IPsec VPN and OpenVPN. I didn’t have a chance to mess around with these. The is a webinar on the IPsec VPN hosted by Untangle on YouTube. I’m curious about the particularities because I’m eager to get this feature operational as soon as possible.

I had an interesting time with the SSL inspector. This app allows you to decrypt HTTPS sessions and intercept traffic before encrypting it again and sending it on its way. Turning this on threw SSL errors on almost all devices in the house. Things like Roku couldn’t connect to YouTube because the certificate chain was incomplete considering the Untangle box was middle-manning the connection. Luckily, it comes with a certificate creator that can serve certificates to client computers so browsers won’t think it’s a malacious redirect.

Transferring the Root certificate around was comically difficult. It couldn’t be transferred on Gmail because of security issues. Those issues might have been because Google thought the attachment was malicious, or that it’s not good OpSec to email root CA installers around, although it was for a client computer. The SSL app is able to generate an installer for Windows machines in additional to the plain cert.

I was able to move it around by putting it on Google Drive. Downloading with Edge threw all sorts of bells and whistles. At first SmartScreen said it didn’t recognize the file and threw the “are you sure you want to download” prompt? Then the warning that “this file could harm you computer” from the browser. Then Kaspersky prompted about the file. Finally, UAC was triggered. This is all in good measure, installing bogus certs on computers this way can be compromising.

SSL inspector needed to be turned off while this configuration was being done. The internet was unusable with browsers like Edge with SmartScreen because of the certificate errors. MAC addresses for devices with hardcoded certs bypassed the SSL inspector all together so they wouldn’t throw errors.

 

stuntsec_ca.png

 

SSL inspector needed to be turned off while this configuration was being done. The internet was practically unusable if the correct certs aren’t installed on the network devices.

Captive Portal and the Brand Manager apps were nice touches to include. These were probably the most fun I had playing around with. The branding manager allows you to provide stock logos that replace the default Untangle logo in the software. I designed a mockup logo for fun and really enjoyed how thorough this functionality was.

The captive portal seems to function in a similar way as the SSL inspector, though I think it uses a different certificate because it throws certificate errors on machines with the SSL inspector cert installed. The captive portal page can include your brand manager content and display and solicit agreement to a terms of service, offer the option to download the certificate and or the installer, log a user in, and brokers a number of other useful functions. Very cool if you’re trying to administer web usage.

 

Stuntman Security 2.png

 

Web Cache is something you want to consider if you’ve got the resources for it. A web cache monitors traffic and puts frequently visited elements in a cache that it can serve locally. If I’m logging on facebook every day, it’s easier, and arguably safer to store the “Facebook” logo locally and serving the local copy instead of asking the website for it. The Web Cache presents a lucrative target for attackers but luckily keeping tabs on its operation with the Untangle reporting system is easy.

There are the features that you would expect to see in home security software. Untangle’s advantage is catching threats over the wire, theoretically before they hit the client box. The complete package includes the two virus scanning apps, the Phish Blocker which I assume is some kind of DNS functionality to check URLs for malpractice. There are the two spam blocker apps which I believe work with some cloud threat database. These tools provide the same functionality as a security suite for your desktop. If you start seeing unusual malware activity you can leverage the firewall against it to really turn up the heat.

In addition to the virus and malware protection, an ad blocker is included. Like the advantage above, Untangle sees the advertising domains and blocks them before they hit the boxes behind it. I know for certain the ad blocker has been busy on my box.

Active Directory is available to further expand your capability on the local network. I didn’t have a chance to mess around with it. Most home networks don’t have active directory services running but some power users out they should get a kick out of it. I played around with policy manager for a bit. It’s useful if you want to run SSL on one group of devices and ignore others, like streaming devices. Essentially each policy runs its own set of apps and generates its own reports. Very useful for compartmentalizing your network.

A lot of the Untangle apps demand more resources as you connect more devices to the network. You need to be conscious of the box running Untangle and how scalable it is. If you’re running a Web Cache for 100 users, the resources required to manage it scales exponentially from 10 useers depending on their workflow. SSL inspector can be a problem if resources are limited while the workload increases. Intrusion detection is another relative resource hog.

I learned about DHCP and routing the hard way, which is always to most effective way. I realized I wasn’t resolving hostnames from devices that were connected to the router. A router, typically by default, sends all information upstream from one IP address. This function is twofold, first it’s because there aren’t enough IPv4 addresses to be issued to every device, and secondly, it’s safer to have the router acting as a firewall so each home device doesn’t directly face the internet.

By changing the wireless router that was behind the Untangle box to “access point” mode, it quickly differed this DHCP serving to the Untangle box. Untangle was then able to resolve the hostname for each device connected to the wifi. This allows for fine tuning of access rules and traffic shaping.

The remote functionality is robust and well-supported. Access can be tailored to the user. Users that only need access to reports are safety granted this access without enabling access to system settings. Multiple boxes can be administered from a single interface. Phone administration is possible through the browser. HTTP administration most be allowed from the client box to allow configuration on a client.

The reports app, though more of a service, is probably the most important app in the box. Reports act as the liaison between the administrator and the Untangle utilities. Graph are easily generated and data is visualized so it can be easily digested on the fly. Reports can be stored on the box for up to 365 days. You will have to account for the resource usage of maintaining this database. Reports can automatically be sent to your email inbox at an interval of your choosing. This report contains much of the top level information about the box’s performance, allow remote administration to be conducted confidently and quickly.

The configuration for each untangle install can be backed up with the Configuration Backup app. It has built in Google Drive functionality and can send and restore from the cloud, eliminating the need for panic if a box becomes physically compromised. Another scenario for this functionality would be sending a configuration template to new boxes. After installation of a new box, you would just need to select the loadout from Google Drive and hours of possible configuration could be avoided. The same backup functionality is available for reports. So essentially, if a box burns up, you just have to replace the hardware and it’s back off to the races thanks to the automated backups.

I had a great time messing around with this software. I’m very pleased with the hardware purchase. The all-in-one computer plus a year’s subscription to Untangle at home was $400. I’m enjoying it so much I’m considering a second box that I can administrate remotely. The opportunity definitely provided me a peace of mind that application solutions couldn’t. Hopefully in the future I can use some of the data for geographic projects. I’ve already started messing around with projecting some geographic data in ArcMap. Here’s to hoping for more positive experiences working with the Untangle box.