Mapping Computer Networks

A network map represents the relationship between objects. This representation can be 2-dimensional or 3-dimensional depending on how the data is structured. Network maps are useful for mapping social relationships, supply chains, and, as I’ll demonstrate in this post, computer networks.

Creating maps of cyberspace is inherently unintuitive. The instantaneous and global nature of networks like the internet defy traditional spatial interpretation. By depicting these networks, for example, on a 2-deminsional plane, the relationship between devices in a network become easier to interpret at a glance.

Below is a network topology map I created to illustrate the relationship of devices I personally manage. For the creation of this map I used the free tools from lucidchart.com. The free component of the tool is limited to 60 elements, including line features.

network map
Full Resolution

The network consists of 8 servers, 2 desktops, 2 laptops, 2 firewalls, and 8 media devices over 2 sites. By using a combination of symbology and labels, each computer and it’s function can be quickly interpreted.

I’d like to take a moment to stress the importance of  what I mean when I say “at a glance” or “on the fly” when referring to data visualizations. Data, in its rawest form, can be difficult to interpret quickly. Visualizations aid the analysis of data by making it more easily interpretable through communication, in terms of presentation, or by analyst in terms of speed and reliability. When I’m referring to elements of data visualizations like maps that contribute to easier data conveyance at a glance, I’m directly addressing things that make the data more communicable in terms of conceptual and spatial accessibility, speed of interpretation, and reliability as related to distinction and the ease of identification.

Stylistically, the above network map is radial in nature, with the internet occupying the space near the center. In networks that use intranets, or private networks, this space might provide a space for the main routers, switches, domains, or any other device that sees the most traffic or performs a key role in the network. The network is split into three parts, all communicating to the other devices through the internet. For this reason the internet becomes the central feature of the map, the backbone of the network. It’s enunciated by its position on the map and since this central position tends to draw the eyes, it’s easier to, you guessed it, interpret at a glance.

Their are 3 sections of the general network structure. We’ll call the line going to the top of the diagram from the internet symbol site A, and the one drawn towards the bottom, site B. The three separate lines drawn from internet symbol going towards the left represent assets that are in the “cloud” or hardware I don’t have physical access to. These machines aren’t on the same network, represented by the separate, non-intersecting lines, but they’re grouped according to the remote nature of their access.

I tried to make the symbology as intuitive as possible, labeling the different devices by their role, technical specifications, operational capacity. For example, the brick wall represents a firewall unit. At the top we see the all-in-one Untangle unit I wrote about in this article (Working with Untangle Firewall). Site A utilizes a two network setup. All the server assets sit behind the firewall and all the personal devices operate off their own router. This is a network security concept called compartmentalization. If a personal device ever became compromised, it could be leveraged against the rest of the network. The server farm is more operationally secure by the extra layer of security provided by the firewall. This also allows the personal devices to bypass firewall rules which might interrupt leisurely “workflows” while at the same time simplifying firewall operation by not requiring additional rules and conditions.

Site B utilizes a different strategy, this Untangle box, featured in this article (Building an Untangle Box) routes and shapes all traffic. However, the traffic is compartmentalized internally by two separate wifi networks and a hardwired network. The server built in this article (Building a 50TB Workstation/Server) operates off of this box via ethernet. Everything that is not handling sensitive operations like SSH work or banking operates on one Wifi with rules tailored specifically for this heightened level of security. Home media and leisure devices use the other wifi. The idea is that if a router ever becomes compromised, it won\t have leverage over all the devices on the network. This is in addition to the routers being in access point mode, sending all traffic to the untangle box for rules and routing. It never hurts to have these fail safes. All traffic going to site B sits behind a firewall, as opposed to site A which sits behind a modem and router combo unit. This is inherently safer considering all traffic must pass through the untangle box as it moves to or from the internet or, theoretically, other devices.

In the cloud there are 3 VPS servers. These host a variety of functions with the core functionality listed beside them on the map. Like mentioned earlier, these servers aren’t on the same network, or even the same country for the matter. This network relationship is related by the individual lines that do not intersect on there way to the internet symbol.

Creating a network consists of a few design element with plenty left up to the author. It’s easy to begin with a radial design in mind, with devices that serve central points in the network at the center. Grouping devices by role or location helps the reader spatially interpret assets on the fly. Using easily understandable symbology and utilizing verbose labeling helps clarify finer details. Like all maps, computer network maps can change and having a program that allows you to update and edit features is useful for making changes.

The future of maps consists of an abundance of cyberspace assets. Being able to map these networks will define a key component in the toolkits of future cartographers.

Working with Vantrue X2 Dashcam and Dashcam Viewer

Dashcams are becoming more and more affordable as they become easier to manufacture and their use becomes more ubiquitous. I had previously used a Rexing R2 dashcam but was looking to get into something with more robust data collection capability. The Rexing R2 served as a good initial exposure for dashcam operation and the associated workflow (storage, editing). I was able to incorporate dashcam operation into my working theory of data curation in that any dashcam data that was collected, even it is not inherently valuable, may prove valuable in the future, and thus, should be stored indefinitely.

A79U_1311590507216150342yaEiXxskm

As a quick example of the usefulness of this kind of dashcam ubiquity, we can look to the meteor that touched down near Chelyabinsk, Russia in February 2013. Almost all of the footage is from dashcams, which are mandatory in the country to prevent insurance fraud, and CCTVs. I’m not saying I’m likely to catch a meteor coming down to Earth and it’s my responsibility as a dashcam owner to be prepared for that moment, but I’d rather be caught with it on rather than off. The footage can also become the medium for other creative expression.

I enjoy working with the footage, speeding it up and putting it alongside music. Driving is something I enjoy and editing driving footage provides a similar satisfaction. Unfortunately, the Rexing R2 and its fish-eyed convex lens was destined to end badly. The lens protruded beyond the safety of the bezel and all it took was one instance of accidentally setting it lens-side down on an abrasive surface for the lens to be slightly cracked, enough to blemish the picture.

Finding a camera which was immune to this kind of operator error was my first priority. Also important was the incorporation of a GPS unit with exportable data. I find a reasonable solution in the Vantrue X2. It was a steal on Amazon for $99, though seems to be out of stock now. It comes out of the box with 2K filming capability, expertly tailored night vision, 64GB microSD support, and an optional GPS mount. This cam checked all the boxes. Two days later I had it installed and took it for a test drive.

A couple things to consider right off the bat; I do quite a bit of driving on average and I’m not one who wants to dismount the camera and export all the footage several times a week. I also thought it would be irresponsible, since I’m storing this footage for whatever future opportunities might arise, to film in less than the full 2K resolution of the camera’s capability. 64GB SD card capability becomes just “OK” at this point, storing between 6 and 7 hours of data before needing to be hooked up to the computer and moved over. 128 or greater might be something I look for in the future, although I’m definitely not in the market for another camera. The 6 hours hasn’t been a problem except for a handful of times I’ve been driving long distances and found myself needing to offload the footage temporary on a device before delivering it to the storage server. However, the average user will not have these problems if they’re not meticulously hoarding this data. The camera has functionality that allows it to overwrite previous footage when it becomes full. Relying on this rolling recording will always assure you have the last 6 hours of driving footage, no maintenance required.

Armed with the camera and GPS mount, I was ready to collect the data, which came naturally over the following months. The next step in this geographic exploration was to incorporate this data in some sort of map. This led me to the Dashcam Viewer by Earthshine Software. This program extracts the GPS data from the videos, plots them on a map, and allows you to cartographically exam your driving. Dashcam Viewer is available for Windows and Mac. Sadly, there is not an official Linux version, although I haven’t tried emulating it on a Linux machine with Wine.

 

dcv_2-1-0_win

 

The first video I thought to make was a realtime video with two maps of different scale, showing where the vehicle is in relation to the surroundings that might not be visible on film. Dashcam viewer includes lines that show differences in relative speed which is a nice touch, and saves time compared to crunching this data manually in something like ArcGIS.

Capturing the map footage required a little ingenuity. I couldn’t save the video of the Dashcam coordinate route so I thought capturing video of the desktop then cropping it to the window in question would the easiest route to get a result. The finer details could be ironed out afterwards. I was able to create the two cropped videos of the maps and using the Filmora editor, was able to combine them with the actual footage. A little editing flare and some music was all it took to combine this rough draft, which served as a proof of concept for future projects.

 

 

Next I wanted to move onto timelapse videos so these new map perspectives could be incorporated. The length of the editing process is something I’m still trying to reduce with this workflow. Capturing the 2 maps in realtime using Xsplit to capture the desktop adds 2 times the length of the original footage to the process. For the next project, I wanted to use a 4 hour segment of films. This would require 8 hours of desktop capture, not acceptable for a productive workflow, but for what I am doing in this early proof-of-concept stage, getting the results is more important than the workflow.

I started running into limitation in the Filmora video editor. Editing with multiple video sources was limited, and I couldn’t export the final production in glorious 2K resolution due to the 1080p limit. Filmora isn’t native to Linux which is the ecosystem I’m trying to move all my production towards. Wine emulation is poor. For the future, I’m looking towards Da Vinci Resolve by BlackMagic. This, I assume, is an intermediate video editing application where Filmora is focus more on entry level editing.

The idea for the second project using the new dashcam was based around a 4 hour trip. I captured all the media. Moved it over to my Windows machine to edit with Filmora. To make the editing process easier, I focused on one source at a time. First, I merged all the dashcam footage into one video. This machine is working with a Q6950 processor so all rendering had to be done overnight. Once the dashcam footage was one video, it was easily muted sped up by a factor of ten, then rendered again. This gave me finalized footage I wouldn’t have to edit when piecing all the sources together.

I then booted up dashcam viewer and started the desktop capture of the maps in realtime. This took over 8 hours for both maps. Once the capture was complete, they were put through some quick editing so post-production would just be piecing the sources together. They were sped up 10x and rendered individually with custom resolutions, so they could sit on top of the original footage seamlessly.

The first map was set to “follow” the GPS signal at a small scale. The second map would show a majority of the trip and often the starting point and destination in the same frame. These provided two different perspectives for the footage in case the viewer wants supplementary geographic information.

Syncing all the footage was something that turned out to be more complicated than expected. I originally wanted the final editing procedure to be just piecing together the three sources, the dashcam footage, and the two maps. However, the maps were often out of sync with the footage, and had to be adjusted manually every few minutes. This led to chopping up the footage, creating errors in the maps halfway, thanks to Filmora and operator errors.

Post production included adding the music, adding the song information, and fading in and out where appropriate. The final product is not perfect, as there are map errors in the middle of the video and at the end, but I’m happy with how the workflow and the product ended up.

 

In the future, I hope to choose a different editor, and see if I can find an additional way to capture and render the maps, with a focus on speed of production. I’d love to find other ways to incorporate GPS information like bearing and speed into the video. Until then, it’s off to add to the every growing collection of dashcam footage.

Building an Untangle Box

A few weeks ago I did a quick write up about the Untangle firewall system my experience installing and using it on a Protectli Vault all-in-one mini PC. Today I’d like to describe a box I set up as an alternative to the model I previously used, the Protectli Vault. For this box I used an old Optiplex 780 purchased on Amazon for $87. I’ve been using the OptiPlex 780 for the starting point in a lot of projects recent due to the fact that it’s modular by nature, easily upgradable, and has components that are powerful enough to tackle any moderately resource intensive modern tasks.

20170929_200622_edit

 

The OptiPlex made a great jump-off for this project. I wanted an untangle box that was a small form factor so it could be easily incorporated into the physical environment where it would be operating but not fquite as small as the Protectli Vault setup I had used before. I tried to keep the budget around around $370, the price of the original Protectli Vault Setup. I wanted to keep the at least as powerful of the Protectli Vault build.

First I took a look at the RAM. The OptiPlex 780 has 4 dual channel DDR3 slots onboard. This is more than capable enough to match the RAM loadout on the Vault. I was able to find an 8GB kit of two DDR3 1600MHz sticks for $56 on Amazon. These sticks were plenty powerful for what I was building. The 7800 came with 4GB of RAM preinstalled, allowing some cost to be recouped. This 4GB might be enough if the amount of services running in the Untangle installation were minimal.

Next was the storage solution. The Vault comes with 120GB of solid state memory so I figured a 2.5″ SSD would be a suitable match for the OptiPlex. I found a SanDisk 120GB SSD, again on Amazon, for $60. This would provide quick read/write speeds for typical Untangle operation and open up the possibility of using disk space for swap operation if the need arose. The 780 comes with a harddrive already installed and they range between 160GB and 250GB. After the SSD installation, these could be salvaged for other projects or to recoup some cost.

Arguably the most important part of this particular build is the network interface. The 780 comes equipped with just the 1 network interface onboard out of the box. This, by itself, isn’t capable of being a functional box. There needs to be at least 2 ethernet ports, one for the internal connection and one for the external connection, for the box to function as a firewall. I decided it would be appropriate incorporate a 4-port 1000Mbps NIC to allow for up to 4 external connections. This one-upped the Vault by allowing an additional connection compared to the 3. I purchased the PRO/1000 Ptquadport from Amazon for $56 (now $50) and, in turn, freed up a 4-port switch I had been using to route local traffic, allowing addition cost reclamation by selling this redundant equipment. The NIC had to be low-profile to accommodate the reduced room in the small form-factor OptiPlex. I decided to additionally include a single port card in the spare PCI slot, bringing the number of external ports to an unprecedented 5.

Finally, I wanted to include a beefy quad-core CPU to again one-up the Protectli Vault. The Q9650 was a work-house Core 2 Quad-core chip in its day and still packs a wallop. This monster can hang with new processing solutions and would be more than enough for this build, theoretically capable of routing over a gigabit of traffic at any time and possibly much more depending on how many local services Untangle is running. I was able to secure one from Amazon for $49. Installing the chip however was tricky.

untangle_box_1.jpg

During the install I periodically powered up the build to ease troubleshooting if problems arose. The assembly did turn out to problematic when I installed the NIC and the new processor. Replacing the CPU was probably this most time intensive step in the process. This process included removing the existing E8500 chip in the OptiPlex, another redudant part that could be sold. The process was made easier by the easily removable heatsink secured by two screws. The hood attached to the heatsink is easily detached from the HDD assembly. Thermal paste was then applied to the new Q9650 and the heatsink was the reattached. The system did not boot, and the OptiPlex was showing the error code “2 3 4”, displayed on the lights at the bottom of the front of the chassis. These lights were accompanied by a solid amber light emanating from the power button, indicative of CPU issues.

Troubleshooting was easy enough. I had a spare OptiPlex 780 laying around that had identical specs andd installed the Q9650 in it after removing it from the Untangle build. Luckily, it booted up, eliminating the possibility that the chip was faulty. I then tried the sparee OptiPlex’s chip, another Q9650, in the new build. This attempt also failed to boot, producing the same error indicators for a faulty chip. This confirmed the problem was local to the new build and narrowed it down to a problem with the board or some part of the CPU assembly. Luckily, the problem was due to how the heatsink was mounted, so no faulty hardware was involved. I attached the heatsink by tightening the screws nearest to the DVD drive first instead of the opposite. This pressure differential most have secured the CPU in an optimal way because the machine booting up properly on the next attempt.

untangle_box_2.jpg

The assembly of all the components was relatively painless apart from the CPU hiccup. With the machine up and running and the software configured, we were off to the races. The physical environment was prepared with a small shelf so the box itself could set out of the way. It was anchored to the wall using some wire to prevent any nudges from sending it crashing to the floor. The build was officially ordained with an Untangle sticker on the case.

20170930_205716_edit

 

The final price was $308, and with current prices, this total is just below $300, putting us about $70 below budget.

OptiPlex 780 $87

2x4GB DDR3 1600MHz RAM $56

SanDisk 120GB SSD $60

4-Port NIC $56

Q9650 Processor $49

Total: $308

If the micro form factor provided by the Protectli Vault isn’t a necessity it is demonstrably proven that a box with a superior CPU and network solution is built for around $70 cheaper. This box can handle anything that will be thrown at it in the foreseeable future and is powerful enough to utilize all of the features in the Untangle software suite. In this scenario the OptiPlex once again proves to be an optimum solution.

Josh Dean Concord Charlotte

Building a 50TB Workstation/Server

Collecting data is a passion of mine. I’ve always enjoyed collecting things. I believe the act of collection is a critical component of the human psyche and experience. The act of maintaining, curating, and growing collections of data is personally and professionally therapeutic and fun. Collecting data, applying it to typical situations is a critical part of approaching everyday life in the 21st century and the better your tools, the better your efficacy. Being able to build these tools yourself puts you in even greater control over your data management solutions and opens the door for unique opportunities to engage with interesting cutting edge technology.

20170910_121556.jpg

Building servers to accomplish the task of storing all the data I’ve collected over the years is a big priority for me. I don’t like to delete things. I don’t like to delete different versions of the same thing. Having the hardware that is capable of scaling and storing my ever-expanding repository of documents, movies, music, data, pictures, games, books, and programs is very important to me. I’ve seen the detrimental effects on not having this data easily available this year as I’ve had 30TB between the cloud and a physical box at home, something that isn’t particularly integral or useful for my workflow.

Having the data available and easily accessible is only one part of the equation. Security is the second part. Computer operation is always a trade off between convenience and security. When it comes to this bulk storage, I’ve come to the conclusion that my personal needs would be better met by having this server offline. By having this server airgapped, I feel like I would have more control over what is ingested and egressed and would be better situated to deal with malicious threats like ransomware.

The planned server is only one part of the solution. I hope this server can function as a backup and that another server, to be built in the future, would handle all the internet-facing and production activity. This would fulfill the data integrity requirement of an offsite backup, making the data that much more secure in the long run and elicit more peace of mind for the administrator.

I decided to run a non-Windows operating system on this machine. I feel like it would require less maintenance, in the form of updates and daily maintenance, as well as eliminate some of the security woes I’ve had in the past with Windows machines.. I decided I want to utilize the ZFS filesystem for the added control of data integrity and the redundancy operations that are superior to traditional RAID. There is no native ZFS support on Windows. First I looked at OpenIndiana, a Solaris distribution that has ZFS baked in. I was worried about hardware support and expandability in the future so this unfortunately might not be an option. I looked at FreeNAS which is a BSD distribution for network attached storage. I wasn’t exactly sure if it had the capability under the hood I was looking for as a workstation, and since the box wouldn’t be connected to a network, a lot of the functionality would not have been utilized. FreeNAS was also limited by its user interface. While it has a robust web interface, the local desktop environment is lacking for use as a workstation.

Securing the hard drives was my first concern when setting up this build. A  great deal was found in the form of Western Digital Easystore 8TB external hard drives from Best Buy. These external enclosures housed WD80EFAX drives that can be easily “shucked” from the enclosure and used for other projects. These hit the shelves at $159.99 a piece which is about $50 cheaper than the cheapest standalone internal drive on the consumer market. I decided to buy as many of these as I could afford, taking off an extra 10% from opening a Best Buy credit card. This is a storage deal you only see once every few years. These drives do come with some drawbacks.

20170910_203558.jpg

I started mounting the hard drives in the Nanoxia Deep Silence 1 case and realized that the mounting holes were not in the standard position. I was only able to secure two out of four mounting components in the drive trays. This was concerning because drives that can give and move in their enclosures will have shorter lifespans. This case would have to sit up vertically so hopefully gravity would provide the same service as the two missing tray mount points. The 1 year warrant is also something to consider compared to the 3-year warranty on most barebones drives.

The PSU from a previous build was the put in the tower. Shipped, the Nanoxia DS1 comes with 11 internal 3.5″ slots in the form of two 3 drive cages and one 5 drive cage. One of the 3 drive cages had to be removed for the 750W, modular PSU to be install. This build screams overkill and this PSU is definitely part of that. My reasoning is future-proofing, but it’s also nice to find some use for extra parts laying. The highest load this machine would experience would likely be several hundred watts less than 750. All 8 hard drives spinning up at once does create a load that needs to be considered. In addition to installing the PSU, I went ahead and screwed in the motherboard standoffs and did some early wire management to make installation easier.

ryzen

The motherboard was dry-fitted, assembled, and tested outside of the case to prevent any troublesome troubleshooting. The CPU and RAM was both easily fitted and popped in respectively. The heatsink was dry-fitted to make sure it successfully fit the AM4 socket, despite only saying AM3 on the box. Thermal paste was then applied to the processor and spread to an even coat with a piece of cardboard before the heatsink was applied for a final time. The 2 sticks of 8GB RAM were double checked to make sure the proper dual channel slots were being utilized. The slots were staggered on this board.

20170910_144746.jpg

Installing M.2 SSD was interesting to do for the first time. I have never had the pleasure of working with one before. The motherboard includes a special standoff for the M.2 SSD and a screw to secure it in place.

20170910_150835.jpg

After everything was installed it was time to power on the motherboard assembly. This would be done outside of the case on a static resistant material first. The PSU needs to power the mainboard molex, the 8 pin CPU power and at least the power switch on the case. At first it didn’t display. Luckily the  B350 motherboard comes with 4 debugging lights which indicate what component is preventing the system from posting.

20170910_155039.jpg

The GPU debug light was on and I did a quick facepalm. I had forgotten that Ryzen series chips did not include integrated GPUs and needed discrete graphics cards in order to display. Luckily I was able to cannibalize a GT 1030 from another computer I had laying around. There is a firepro W4100 on the way for another project that might have to be adopted for this project. The 1030 will do for now. Definitely something to consider. I might not have bought this Ryzen originally since I failed to foresee the cost of a discrete video card. I’m still satisfied with my purchase so far. $300 for 8 cores is a great deal no matter how you slice it. If I decided to use the GTX 1030, I will need to get a full profile bracket so it will be flush with the slots on the back of the machine.

20170910_163921.jpg

With the motherboard posted and fitted, the IO shield was installed on the back of the case. Wires in the case was further arranged for management later. The DVD optical drive was hooked up. FreeNAS was booted up to try out an OS. The system booted fine into the operating system after installation which is always nice. Installing the OS to the M.2 SSD was humorously fast. Decided to switch over to Ubuntu after seeing the FreeNAS’s lack of a DE. OpenIndiana, my other choice, needed some BSD shell knowledge that I was not particularly in the mood to figure out. “Just Working”™ is something I look for in an OS and Ubuntu should support everything out of the box, has a DE, and can run ZFS.

I then encrypted the disk and encrypted the home folder. These are two basic hardening steps for the OS and Ubuntu offers to perform them during the installation process. No one will be able to boot into the system using a rescue CD, DVD, or USB without the password considering these two encryption options. The M.2 SSD will allow this constant encryption work to be transparent and almost unnoticeable thanks to the 3GB/S read/write speeds, something that might bottleneck performance on other hard drive technologies. The speed of this little device is shocking. An install that can take as long as fifteen minutes was done in less than three, including the time intensive encryption operations. This is a fantastic form factor that makes SATA SSD’s seem like they crawl.

20170910_202754.jpg

After the basics were up and functioning it was time to connect everything on the board; audio ports, USB ports, HDD lights, power lights, reset switch, fans. The SAS controller card was set to go in next, followed by the HDD array. The SAS card booted up properly the first time and occupied the second PCIE 16 slot on the motherboard. I decided it would be best to installing drives one at a time. This way I could erase the preinstalled partition left over from the WD Easystore software, label the drives, and test them individually. Another issue arose over the form factor of these drives. They would not clear the back of the cage, which only allowed one side of the clips to secure the drive in place, further adding to the instability problems. It would be possible to alleviate this by modifying the cages themselves. This is not something I wanted to jump straight into. After everything was checked out and noted it was time to install the ZFS filesystem.

20170911_003750.jpg

ZFS has to be downloading from the Ubuntu repository. I wanted to create a whitelist that only allowed communication from the server to the Ubuntu repository. Messing with IP tables was not providing the functionality for URLs I was used to with other solutions like Untangle. I decided it was easier to deal with it on the hardware firewall later. Sudo apt-get install zfs is all it took to get the filesystem utility ready to operate. I still need to explore ZFS as a system. This server will give me a platform to experiment before I bring the 25TB of data down from the Amazon cloud.

The wiring for the drives was an extremely tight fit. There was not enough room for the cable management I wanted to perform. The side panel was barely able to latch into place and even then the panel was bulging were the wires were most crowded. Most of the slack wiring is in the open side of the case. A possible mod for this would be cutting a hole where the crowding and installing some kind of distended chamber for the excess wiring. This is something to consider in the future.

Below is the list of parts and the link to this list on PC part picker.

PC part picker link

screenshot-pcpartpicker.com-2017-09-17-14-32-33.png

There are definitely some things I want to handle with this project in the near future. The case either needs to be modified to allow more cable room, are the drives need to be refitted so they dump cables into the front side of the case. This might also alleviate the crowding against the drive cages.

I want to find a good use for the Ryzen 7. Video capture was one of the first things that came to mind. I’d like to include a capture card in this build, having a second system to capture video greatly increases the intensity of operation that can be done on a primary machine without the processoing overhead of recording on the same machine.

I need to install the 2 hard drive hot swap bays. This will fill up all the remaining 5.5″ slots on the case. Having two hotswap bays makes the ingest process easier, allowing two drives to be ingested or egressed as well as duplication operations.

I’d like to investigate additional uses for the build. It hasn’t been completely put into production so the finer details of operation are still up in the air. This build was one of the most powerful machines I’ve ever had the opportunity to put together. I can’t wait to start to begin sorting and curating the data on this machine and expanding its functionality in the future. Hopefully “Ratnest” has many years of hoarding data ahead of it.

Update:

After rereading this post I forgot to mention the 50TB total storage. 8 x 8TB is 64TB of raw storage. This is shrunk to 47GB when using ZFS with 1 drive redundancy.

By reversing the direction of the drives in the cage, I was able to route the cords in a manner that allowed the sides to fit on the case. This mounting technique allowed the drives to clear the back of the cage, alleviating the need for case modification, always a plus when it is not completely necessary.

 

Being able dto situate these drives in the case and close it without having a visible bulge in the side panel effectively completed this build. It is now operational and should provide enough storage for all the data I’m ingesting for the next couple years at least.

All the dense drives made this these the heaviest build I ever constructed, weighing in at almost 50 pounds, a pound for every Terabyte.
20171003_122153_edit

Here’s to hoping for a successful archival workflow in “Ratnest”s future.

Josh Dean Concord Charlotte

20171004_162224_edit

Writing a RAID Calculator in Python

RAIDr is a RAID calculator written in python. It accepts input from the user and calculates how a certain configuration of hard drives will be allocated across different RAID configurations. This is the first program I ever wrote and the project that got me interested in programming. It’s not the most efficient, and there are some alternate ways to approach this problem, but I’m happy with the product as it turned out. It is still incomplete, but hopefully, someone can find it useful. The code is commented with thoughts on how it should function and things that need to be done. I’m not a professional python programmer, and my methodology might not be completely “pythonic”, but this was a great project for me to gain exposure to programming and syntactical logic. Any constructive criticism is welcomed.

</pre>
## Josh Dean
## RAIDr
## Created: 2/14/2017
## Last Edit: 3/21/2017
## Known bugs:

## Global Declarations

hddnumvar = 0
hddsizevar = 0
raidvar = 0
hddwritevar = 0 ## used to mitigate reference error in RAID 10 calculation

## Functions

def hdd_num():
	global hddnumvar
	print ("\nHow many drives are in the array?") ## eventual formatting errors will come from here
	hddnumvar = input() ## necessary if variable is global?
	if hddnumvar == 1:
		print ("Error: Can't create a RAID with 1 disk.")
		hdd_num()
	elif hddnumvar > 1:
		print hddnumvar, "drives in the array"
		print "----------------------- \n"
	else:
		print ("I don't know what you entered but it's incorrect.")

def hdd_size(): ##needs error parsing
	global hddsizevar
	print ("What is the capacity of the drives? (gigabyte)")
	hddsizevar = input() ## possible to use line break with input
	print hddsizevar, "raw GiB per disk"
	print "----------------------- \n"
	print("%s drives in the array of %s GiB each.") % (hddnumvar, hddsizevar) ##there was a return value here, implication, seems to be hanging here with a syntax error?
	##removed the % format for something else, seems to be working single quotations critical for functional syntax, fixed it by including the arguments in parathesis
	raid_prompt()

def raid_prompt(): ##update this to reflect actual raid configurations, calls raid_calculation, all edits and calls should start here
	print ("\n1 - RAID 0")
	print ("2 - RAID 1")
	print ("3 - RAID 5")
	print ("4 - RAID 5E")
	print ("5 - RAID 5EE")
	print ("6 - RAID 6")
	print ("7 - RAID 10")
	print ("8 - RAID 50")
	print ("9 - RAID 60 \n")
	raidvar = input("What raid configuration? \n")
	raid_calculation(raidvar)

def raid_calculation(raidvar): ## just handles the menu
	if raidvar == 1:
		hddtotal = hddsizevar * hddnumvar ## variables need to go first
		print "\n-----------------------" ## /n doesn't need a space to seperate, bad formatting, best to put this in front
		print ("RAID 0 - Striped Volume")
		print hddnumvar, "drives in the array"
		print hddsizevar, "raw GiB in the array per disk"
		print "%s raw GiB in the array total" % hddtotal
		print "Total of", hddnumvar * hddsizevar, "GiB in the RAID array." ## this need alternative wording throughout the program
		print "%s times write speed" % hddnumvar ## Can I put these two prints on one line? Multiple % variables?
		print "%s times read speed" % hddnumvar
		print "No redundancy"
		print "No hot spare"
		print "----------------------- \n"
	elif raidvar == 2:
		print "\n-----------------------"
		print ("RAID 1 - Mirrored Volume")
		print hddnumvar, "drives in the array"
		print hddsizevar, "raw GiB per disk"
		print "Total of", hddsizevar, "GiB in the array."
		print "%s times read speed" % hddnumvar
		print "No write speed increase"
		hddredunvar = hddnumvar - 1
		print "%s disk redundancy" % hddredunvar
		print "No hot spare"
		print "----------------------- \n"
	elif raidvar == 3:
		if hddnumvar < 3:
			print "\nYou need at least 3 disks to utilize Raid 5"
			disk_num_prompt()
		else:
			print "\n-----------------------"
			print ("RAID 5 - Parity")
			print hddnumvar, "drives in the array"
			print hddsizevar, "raw GiB per disk"
			print "Total of", (hddnumvar - 1) * hddsizevar, "GiB in the array."
			hddreadvar = hddnumvar - 1
			print "%s times read speed" % hddreadvar
			print "No write speed increase"
			print "1 disk redundancy"
			print "No hot spare"
			print "----------------------- \n"
	elif raidvar == 4:
		if hddnumvar < 4:
			print "\nYou need at least 4 disks to utilize RAID 5E\n"
			disk_num_prompt()
		else:
			print "\n-----------------------"
			print ("RAID 5E - Parity + Spare")
			print hddnumvar, "drives in the array"
			print hddsizevar, "raw GiB per disk"
			print "Total of", (hddnumvar - 2) * hddsizevar, "GiB in the array."
			hddreadvar = hddnumvar - 1
			print "%s times read speed" % hddreadvar
			print "No write speed increase"
			print "1 disk redundancy"
			print "1 hot spare"
			print "----------------------- \n"
	elif raidvar == 5:
		if hddnumvar < 4:
			print "\nYou need at least 4 disks to utilize RAID 5EE\n"
			disk_num_prompt()
		else:
			print "\n-----------------------"
			print ("RAID 5EE - Parity + Spare")
			print hddnumvar, "drives in the array"
			print hddsizevar, "raw GiB per disk"
			print "Total of", (hddnumvar - 2) * hddsizevar, "GiB in the array."
			hddreadvar = hddnumvar - 2
			print "%s times read speed" % hddreadvar
			print "No write speed increase"
			print "1 disk redundancy"
			print "2 hot spare"
			print "----------------------- \n"
	elif raidvar == 6:
		if hddnumvar < 4:
			print "\nYou need at least 4 disks to utilize RAID 6\n"
			disk_num_prompt()
		else:
			print "\n-----------------------"
			print ("RAID 6 - Double Parity")
			print hddnumvar, "drives in the array"
			print hddsizevar, "raw GiB per disk"
			print "Total of", (hddnumvar - 2) * hddsizevar, "GiB in the array."
			hddreadvar = hddnumvar - 2
			print "%s times read speed" % hddreadvar
			print "No write speed increase"
			print "2 disk redundancy"
			print "No hot spare"
			print "----------------------- \n"
	elif raidvar == 7:
		if hddnumvar < 4:
			print "\nYou need at least 4 disks to utilize RAID 10\n"
			disk_num_prompt()
		elif (hddnumvar % 2 == 1):
			print "\nYou need an even number of disks to utilize RAID 10\n"
			disk_num_prompt()
		else:
			print "\n-----------------------"
			print ("RAID 10 - Stripe + Mirror")
			print hddnumvar, "drives in the array"
			print hddsizevar, "raw GiB per disk"
			print "Total of", (hddnumvar / 2) * hddsizevar, "GiB in the array."
			hddreadvar = hddnumvar / 2 ## actual write variable calculation
			print "%s times read speed" % hddnumvar
			print "%s write speed increase" % hddreadvar
			print "At least 1 disk redundancy"
			print "No hot spare"
			print "----------------------- \n"
	elif raidvar == 8: ## bookmark, need formulas
		if hddnumvar < 6:
			print "\nYou need at least 6 disks to utilize RAID 50\n"
			disk_num_prompt()
		else:
			print "\n-----------------------"
			print ("RAID 50 - Parity + Stripe")
			print hddnumvar, "drives in the array"
			print hddsizevar, "raw GiB per disk"
			print "Total of", (hddnumvar - 2) * hddsizevar, "GiB in the array."
			hddreadvar = hddnumvar - 2
			##print "%s times read speed" % hddreadvar
			##print "No write speed increase" # Although overall read/write performance is highly dependent on a number of factors, RAID 50 should provide better write performance than RAID 5 alone.
			print "2 disk redundancy"
			print "No hot spare"
			print "----------------------- \n"
	elif raidvar == 9:
		if hddnumvar < 8:
			print"\nYou need at least 6 disks to utilize RAID 50\n"
			disk_num_prompt()
		else:
			print "\n-----------------------"
			print ("RAID 60 - Double Parity + Stripe")
			print hddnumvar, "drives in the array"
			print hddsizevar, "raw GiB per disk"
			print "Total of", (hddnumvar - 4) * hddsizevar, "GiB in the array."
			hddreadvar = hddnumvar - 2
			##print "%s times read speed" % hddreadvar
			##print "No write speed increase" # Although overall read/write performance is highly dependent on a number of factors, RAID 50 should provide better write performance than RAID 5 alone.
			print "2 disk redundancy"
			print "No hot spare"
			print "----------------------- \n"
	elif raidvar > 9:
		print ("Error: Please select a number between 1 and 9")
		raid_prompt()
	elif raidvar == 0: ## additional error parsing required here
		print ("Error: Please select a number between 1 and 9")
		menu_prompt() ## ubiquitous for all loop items that aren't errors

def disk_num_prompt(): ## this will eventually need to except arguments that are context sensitive for raid type and disk requirements, perhaps handle this is the raid_calculator function
	global hddnumvar
	print "Adjust number of disks?"
	print "1 - Yes"
	print "2 - No"
	disknummenuvar = input()
	if disknummenuvar == 1:
		hddnumvar = input("\nHow many drives are in the array? \n")
		if hddnumvar == 1:
			print "Error: Can't create a RAID with 1 disk."
			hdd_num()
		elif hddnumvar > 1:
			print "\nUpdated"
			print hddnumvar, "drives in the array" ## displays once for every loop, hdd_num_input for mitigation
			raid_prompt()
		else:
			print ("I don't know what you entered but it's incorrect.")
			disk_num_prompt()
	elif disknummenuvar == 2:
		raid_prompt()
	else:
		print("I don't know what you entered but it's incorrect.")
		disk_num_prompt()

#below is the menu for the end of the selected operations
def menu_prompt(): ## need additional option to go to GiB to GB converter?
	print "1 - RAID menu"
	print "2 - Quit"
	print "3 - Start Over"
	menu = input()
	if menu == 1:
		raid_prompt() ## looping, quit() function is ending script, will need revision
	elif menu == 2:
		print "Cya"
		quit()
	elif menu == 3:
		start()
	elif menu == 0:
		print "Error: Please select 1, 2, or 3 \n"
		menu_prompt()
	elif menu > 3:
		print "Error: Please select 1, 2, or 3 \n"
		menu_prompt()
	else:
		print "quit fucking around" ##formatting

def data_transfer_jumpoff(): ##BOOKMARK
	print "What is the transfer speed? Gigabytes, please"
	transfervar = input()
	print"What denominator of data size?"
	print"1 - byte"
	print"2 - kilobyte"
	print"3 - megabyte"
	print"4 - gigiabyte"
	print"5 - terabyte"
	transferunit = input()
	print"How much data?"
	transferamount = input()

## Start Prompt, this needs to be expanded upon
def start(): ##easier way to reset all these variables?
	startmenu = 0
	raid_var = 0 ## should be an inline solution for this in its own function, it just works
	hddnumvar = 0
	hddsizevar = 0
	raidvar = 0
	print("\nChoose an operation") ## line break might cause formatting errors look here first
	print "1 - RAID calculator"
	print "2 - Data Transfer Calculator"
	startmenu = input()
	if startmenu == 1:
		hdd_num() ## these need to be called in a more functional manner
		hdd_size()
	elif startmenu == 2:
		data_transfer_jumpoff()
	else:
		print "Not supported\n" ## will require edit
		start()

#main scripting

start()

 

The operation of the program begins by calling the start functions. I put this function call at the bottom of the script so it would be easily accessible. start() is the last function before the initial start call. From this menu the user is asked which of two currently implemented operations they wish to perform: RAID calculator or Data Transfer Calculator. Data Transfer Calculator is still a work in progress.

system-hdd-raid.jpg

 

When RAID calculator is selected the user is queried about the number of hard drives and their capacity through the calling of two functions: hddnum and hddsize. These functions would be called upon several times during the session of the program, so I thought it would be appropriate to make them their own functions. These functions read input from the users and set the appropriate variables for use during the calculation.

Next the RAID formats available are listed, one of which the user can choose which calculation they want to perform on the previously set variables. In this instance of the code, RAID 1 through 10 work fine, but the functions for RAID 50 and 60 are missing their capacity calculations since the formulas are not as straightforward. Once the selection is made, the results of the calculations are displayed.

At the end of the operation, users are presented with several options. They can change the variables and recalculate. The RAID calculation itself could be changed. The main menu can also be called in the future to perform a data transfer calculation. In the future, it might be beneficial to pass the size of the array, and allow the calculation of the transfer speed by just asking the user for the connection speed. This data could be appended to the RAID data. It might also be beneficial to include a memory function to remember specific RAID configurations, and read it to a text file than can be loaded on subsequent runs.

Another function that might be useful is the reconciliation of GiB values and GB values. This would help if users are using an NTFS file system. It might also be useful to include other filesystem types in the calculations to get the most accurate numbers possible and maximum compatibility.

Again, this was fun to make and I find myself using it from time to time. There is still a lot of work to do if the program can stand on it’s own. Taking user input comes with an interesting set of problems that could allow certain inputs to change the functions of the program. If the user isn’t intentionally trying to break the program this shouldn’t be an issue, the instructions and commands are very clear when user input is necessary. There is some formulaic work to be done with the two newest RAID formats.

Python was a great language for me to grasp the beginning intricacies of programming. I feel like the capability to make even more intricate programs is possible. Combining the operating structure of something like RAIDr with GIS functions illustrated below would allow easy semi-automatic scripting of tasks. The sky is once again the limit.

Josh Dean Concord Charlotte

Mapping Botnets

A botnet is collection of compromised computers controlled by an individual or a group of malicious actors. You may have heard the term “owned” thrown around online or in gaming communities. The origin of the word comes from taking control of another computer. If I get you to run a Trojan that places a backdoor on your computer, leverage this backdoor to escalate my privileges on a system, then deescalate your privileges, you no longer own your computer. Your computer is “owned” and administered by someone else, likely remotely. A botnet is just a collection of these compromised machines which work in tandem, commanding and controlling each other, pushing updates, mining cryptocurrency, running portscans, launching DDOS attacks, proxying an attacker’s connection, and hosting payloads for attacks.

The larger the botnet the better but other features like connection speed, hardware, and topology on a network ultimately define a computer’s usefulness in a botnet. Don’t get me wrong, a hacker is not going to picky about the computers that are added. There is a job for every piece of hardware.

The larger the botnet the more attention it draws. The larger botnets can also be leased out on black markets for attacks and other malicious activity. There is a sweet spot where a botnet is powerful enough to be leveraged but low key enough to fly under the radar. Flying under the radar is something Mirai, a botnet from late 2016 did relatively well. Mirai took control of IoT (Internet of Things) devices with weak passwords. These devices included TV boxes, closed circuit TV equipment, home thermostats and other “things”. These devices are set up to run without administration, so once they were owned by an attacker, they were likely to remain owned and under the radar. vDos, a DDOS botnet for hire, holds the record for the largest DDOS botnet. The owners were arrested in Israel in August 2017.

I’ve been dealing with what I suspect to be a botnet on my home network. I got lucky the other day after installing a home firewall. After blocking a suspect connection I was swarmed with thousands of attempted sessions from all over the world. My working theory is that this is some botnet using P2P networking for command and control infrastructure and it was trying to see where the computer it has lost contact with went. I was able to export this 5 minute period of connections to a csv file and plot it on ArcMap. The following map is what was produced.

 

botnet activity 1.png

I’m a firm believer that every problem should be approached, at least hypothetically, through a geographic perspective. By putting this data on a map, an additional perspective is provided that can be analyzed. Looking at this map for the first time was also surprisingly emotional for me. I have been chasing this ghost through the wire since December 2016 and, through the geographic perspective, was finally able to see and size up the possible culprit.

I had to filter out the United States from the dataset because I was running an upload to an Amazon web service which would have added inconsequential coordinates to the United States, skewing the data. This data would later be parsed and included.

Immediately I was drawn to the huge cluster in Europe. If this is truly the botnet I’ve been looking for, Europe would be a good place to start looking. There were 7000 sessions used in the dataset. I’m grateful that Untangle firewall includes longitude and latitude coordinates in the data it produces. This made the data migration easy and painless.

I got lucky again two weeks later when I got another swarm of sessions from what I assume to be the same botnet. This was, again, after I terminated a suspect connection, suggesting that this experiment is repeatable which would provide an avenue for reliable data collection. I then took to the new ArcGIS Pro 2.0 to plot some more maps. With 2 sets of data, analysis could be taken to the next level through comparison.

 

2017-08-15.png
Full Resolution

 

First I have to say that this new ArcGIS interface is beautiful. It’s reminiscent of Microsoft Office due to the ribbon toolbar layout. I found the adjustment period quick and the capability expanded compared to earlier versions and standalone ArcMap. After using ArcMap I was surprised to see how responsive and non-frustrating this was to use. I ran ArcMap on a bleeding-edge system with 16GB of RAM and saw substantial slowdown. I was able to run this suite on an old OptiPlex system with 4GB of RAM with no noticeable slowdown. It is truly a pleasure to work with.

 

botnet_activity_8_13_17_small.png
Full Resolution

 

Using the second set of data I was able to produce the map above. I went ahead and created a huge resolution image so I could see the finer geographic details of the entities involved. This dataset includes the United States because I wasn’t running any background processes at the time the data was collected. I can safely assume this map represents only suspected botnet connections. I was glad to see a similar distribution, with Europe continuing to produce the majority of the connections. The real fun begins when we combine these two datasets but first let’s take a moment to look over the patterns in the above map.

Just by looking at-a-glance we can see there is a disproportionate amount of connections originating in Europe. There seem to be 4 discernable areas of concentration in Europe: The United Kingdom, the Benelux region, The Balkans, and Moscow. Looking at the United States we see a majority of connections coming from the Northeast United States, and across the Saint Lawrence in Canada. Florida is represented, as is the Bay area and Los Angeles. Vancouver, Canada seems to have a strong representation. Connections in South America are concentrated along the mouth of Rio Plata, where the major population centers are, and the coast of Brazil. A lot of Southern American tech operations happen in this region. If there were compromised computers on the continent, this would be an appropriate area to find them.

China seems to be under represented. The last network security maps I made were overwhelmingly populated by Chinese IPs. This map seems to feature only Beijing of the three major coastal cities. The Korean peninsula seems to have a strong representation. Central and Southern Asia are not represented strong except for India and, like China, it would seem to be underrepresented considering the population and amount of internet connected devices in the country.

It turns out Singapore is a large player in the network. However, it’s not inherently apparent given Singapore’s small footprint. These point maps don’t properly represent the volume of connections for some areas where many connections originate from a small area. By using heatmaps we can combine the spatial and volume elements in an interesting way.

Next we’ll look at the combination of these two point databases.

 

botnet_activity_both_days_7_31_17_top_small.png

 

I included the lower resolution map above so the points could be easily seen. A level of detail is lost but it allows it to be easily embedded in resolution sensitive media like this webpage.

The idea here was that, since a majority of the points overlap, a comparison of changes could be made between this two week period. I parsed the United States data from the first dataset so it could be included and compared. By focusing on what dataset is layered on top, we can infer which computers were removed from the botnet, either through being cleaned up or going offline, and computers that were added to the botnet in this two week period. I’m operating under the assumption that this is a P2P botnet, so any membership queries are being performed by almost every entity in the system. I’m also assuming this data represents the entirety of the botnet.

When we put the original dataset created on 7-31-17 above the layer containing the activity on 8-13-17 we’re presented with an opportunity for temporal as well as spatial analysis.

 

botnet_activity_both_days_7_31_17_top.png
Full Resolution

 

By putting the 7-31-17 dataset on top, we’re presented with a temporal perspective in addition to the geographic perspective. Visible purple dots are not included in the first dataset or else they would be overlapped by a green dot. These visible purple dots indicate machines that have presumably been added to the botnet. With more datasets it would be able to track the growth of these networks.

botnet_activity_both_days_8_13_17_top_small.png

Above is a reprojection of the data with the 8-13-17 dataset on the top layer. The temporal perspective has shifted when we change up the ordering. Visible green dots from the first set may indicate machines that are no longer part of the botnet when the second dataset was created. Machines going offline from a botnet is plausible but it’s also possible that the machines were just offline or unable to establish a session. It’s entirely possible that even with a P2P networking scheme, the entire botnet does not ping every system that appears to go offline with every machine on the network. This would seem like a serious security error by the botnet operator. It’s also entirely possible they’re not trying to cover their tracks and employing a “spray and pray” tactic, running the botnet at full capability and not worrying about the consequences. A full resolution image is linked in the caption.

By looking at both sets under the assumption that the entire botnet revealed itself, we can see if the botnet is growing or shrinking. If their are more visible purple dots on the map where green dots are layered on top compared to the visible green dots on the map where purple dots are layered on top, the botnet is growing. If the opposite is true, the botnet is shrinking.

botnet_activity_both_days_8_13_17_top.png
Full Resolution

 

The most interesting features of these comparison maps I’ve found is the predilection for certain countries and regions. Looking at the rotation of computers, we see the Northeast United States and Florida as hotspots for this activity. The reason is not clear, but this serves as a starting point for additional research. It’s important to remember that data reflects population. Major cities all show signs of activity. Major activity concentrations can be empirically defined by normalizing populations. The activity seems to proliferate from areas where activity is already established. Perhaps there is some kind of localized worm activity used for propagation. Let’s take a look at the real elephant in the room; Europe.

 

botnet_activity_both_days_8_13_17_top_europe extent_marked.png

 

The majority of machines seem to be in Europe. There are certain regions that seem to have concentrated activity. They are marked in red above. From left to right; The UK, the Netherlands, and Hungary. There’s also concentrations in Switzerland, Northern Italy, Romania, and Bulgaria.

The main three concentrations pose interesting questions. Why is there so much activity in UK? The Netherlands concentration can be explained by the number of commercial datacenters and VPS operations. A lot of for-rent services operate out of the Netherlands making it a regular on IP address queries.  Hungary is interesting and a befuddling find. There is no dominating information systems industry in Hungary like in the Netherlands What do all these countries have in common? Why are the concentrations so specific to borders? Answering these questions will be critical in solving the mystery. Next we’ll try our hand at some spatial analysis.

 

botnet_activity_7_31_17_heatmap_std_deviation_small.png

 

A kernel density map, also known as a heatmap, shows the volume of data in geographic space. This is an appropriate spatial analysis to run alongside the point map because it reveals the volume of connections that may be buried under one point. If one point initiates 100 sessions, it’s still represented as one point. These heatmaps reveal spatial perspectives that the point maps cannot.

 

botnet_activity_7_31_17_heatmap_std_deviation_large.png
Full Resolution

 

Immediately we see some interesting volumes that were hidden in the point map. Moscow lights up in this representation, indicative that many connections came from a small geographic area. By using standard deviation to divide the data, the biggest players show up in red. The circular pattern indicates that many connections come from a small area. There is big representation in Toronto, Canada that wasn’t completely apparent on the other maps. Our focus area of UK and the Netherlands are represented. Peripheral areas like Northern France and Western Germany light up on this map, suggesting concentrated activity, perhaps in the large metro areas. Seoul Korea lights up, suggesting large volumes of connections. There is notable activity in Tokyo. Like I was saying before, Singapore lights up in this map. Singapore is a small city-state that exists on the tip of peninsular Malaysia on the Malacca Strait. Connections here would be difficult to distinguish considering the small square mileage of the city. This raises a peculiar question. Why is this botnet so particular about boundaries? Singapore is crawling with connections but neighboring Malaysia, possibly sharing some of the same internet infrastructure, is quiet on the heatmap.

 

botnet_activity_8_13_17_heatmap_std_deviation_small.png

 

As with the other maps, I created a small and a large resolution version. For these kernel density maps, there are several options to represent the data. I chose to use standard deviation and geometric delineations of the data. Each provide a unique perspective and every additional perspective might reveal something new. The geometric map “smooths” the distribution of data, showing areas that might not have been significant enough to appear in the standard deviation representation.

 

botnet_activity_8_13_17_heatmap_std_deviation_large.png
Full Resolution

 

In the future it might be beneficial to select by country borders and make a chloropleth map to show the number of sessions per country. This would reveal countries with multiple sessions from the same coordinates.

It might also be beneficial to parse the data further and add appropriate symbology and additional maps for data that was present in both sets as well as which points were unique to one set. This set of 3 maps would present the data in an additional spatial context, allowing another perspective for analysis.

As always, I will be on the hunt for additional data. The next step for this project is finding out the condition for reproducing this swarm of connections. If it does turn out to be easily reproducible, the real fun begins. Additional data would be collected at regular intervals and mapped accordingly. With more data comes more realization. Automating the data collection and mapping would be the final step. At some point a geographic perspective would be so apparent, the next steps will become clear.

Until then I’m still on the warpath. Never has research been so personal to me.

i-will-find-you.gif

Imgur Album

Josh Dean Concord Charlotte

 

Working with GIS and Python

Python is a powerful scripting language that allows users to script repetitive tasks and automate system behaviors. Python is not object oriented, differentiating its development and operation from languages like C++ and Java. The scripting syntax of Python might ease the learning curve for those new to programming concepts. GIS is a great introductory to Python programming. Please excuse any formatting errors. Some indentation was lost when copying over.

 

GIS_analysis_esri.jpg

 

ArcGIS has robust support for Python, allowing many GIS methods to be automated for optimized development with the ArcMap software framework. ArcPy is a python module that allows python to directly interface with ArcGIS software, giving the user powerful scripting capabilities within the ESRI software ecospace. ArcMap has a built in script editor which provides a graphical interface users can use to construct scripts without the default python shell. This feature is called Model Builder, and it makes the relationship between python and the ArcGIS toolkit easier to understand and implement for the visual thinker.

I provided examples of my own work that have been written in either Model Builder or in the Python IDE. I tried to keep the examples strictly geographic for this post. These scripts aren’t guaranteed to work flawlessly or gracefully. This is a continued learning experience for me and any constructive criticism is welcome.

Here’s an example of what I’ve found possible using python and ArcPy.

 


# -*- coding: utf-8 -*-
# ---------------------------------------------------------------------------
# 50m.py
# Created on: 2017-03-28 15:18:18.00000
# (generated by ArcGIS/ModelBuilder)
# Description:
# ---------------------------------------------------------------------------

# Import arcpy module
import arcpy

# Local variables:
Idaho_Moscow_students = "Idaho_Moscow_students"
Idaho_Moscow_busstops_shp = "H:\\Temp\\Idaho_Moscow_busstops.shp"
busstops_50m_shp = "H:\\Temp\\busstops_50m.shp"
within_50m_shp = "H:\\Temp\\within_50m.shp"

# Process: Buffer
arcpy.Buffer_analysis(Idaho_Moscow_busstops_shp, busstops_50m_shp, "50 Meters", "FULL", "ROUND", "ALL", "", "PLANAR")

# Process: Intersect
arcpy.Intersect_analysis("Idaho_Moscow_students #;H:\\Temp\\busstops_50m.shp #", within_50m_shp, "ALL", "", "INPUT")

 

The output is tidy and properly commented by default, saving the user the usual time it takes to make the code tidy and functionally legible. It also includes a proper header and the location of map assets. All of this is done on the fly, making sure quality code is produced every time. This is a great reason to use model builder over manually programming in in the IDE.

The script above takes a dataset containing spatial information about students and bus stops in Moscow, Idaho, applies a 50 meter buffer to the bus stops and creates a shapefile of all the students that intersect this buffer. This information can then be reapplied by entities involved in either of these operations, meaning, operations can be applied to this newly created 50m layer on the fly. We can then increment the data using the model builder to create shapefiles for different buffers.

The benefit of this over manually creating the shapefile is the obscene amount of time saved. Depending on how thorough the GIS is, each one of these points might need its own shapefile or aggregation of shapefiles. This script runs the necessary 100 or so scripts to create the spatial assets in a fraction of the time it would take a human.

The script below takes the same concept but changes the variables so the output is 100m instead of 50m. Segments of the code can be changed to augment the operation without starting from scratch. This makes it possible to automate the creation of these scripts, the ultimate goal.

 

# -*- coding: utf-8 -*-
# ---------------------------------------------------------------------------
# 100m.py
# Created on: 2017-03-28 15:19:04.00000
# (generated by ArcGIS/ModelBuilder)
# Description:
# ---------------------------------------------------------------------------

# Import arcpy module
import arcpy

# Local variables:
Idaho_Moscow_students = "Idaho_Moscow_students"
Idaho_Moscow_busstops_shp = "H:\\Temp\\Idaho_Moscow_busstops.shp"
busstops_100m_shp = "H:\\Temp\\busstops_100m.shp"
within_100m_shp = "H:\\Temp\\within_100m.shp"

# Process: Buffer
arcpy.Buffer_analysis(Idaho_Moscow_busstops_shp, busstops_100m_shp, "100 Meters", "FULL", "ROUND", "ALL", "", "PLANAR")

# Process: Intersect
arcpy.Intersect_analysis("Idaho_Moscow_students #;H:\\Temp\\busstops_100m.shp #", within_100m_shp, "ALL", "", "INPUT")

 

This example with a 100m buffer instead of a 50m buffer, can either be changed in model builder itself, manually using the replace function in your favorite text editor, or changed in ArcMap’s model builder. By changing one variable we have another porperly formatted script saving time that would have been spent manually operating the tools in the ArcMap workspace. This can be further developed to take input from the user and running the tools directly through arcpy, allowing for the possibility of “headless” GIS operations without the need to design manually.

This functionality extends to database operations. In the following script shapefiles are created by attributes in a table.

 


# ---------------------------------------------------------------------------
# airport_buffer.py
# Created on: 2017-03-30 09:06:41.00000
# (generated by ArcGIS/ModelBuilder)
# Description:
# ---------------------------------------------------------------------------

# Import arcpy module
import arcpy

&nbsp;

# Local variables:
airports = "airports"
airports__4_ = airports
Airport_buffer_shp = "H:\\Ex7\\Exercise07\\Challenge\\Airport_buffer.shp"

# Process: Select Layer By Attribute
arcpy.SelectLayerByAttribute_management(airports, "NEW_SELECTION", "\FEATURE\" = 'Airport'")

# Process: Buffer
arcpy.Buffer_analysis(airports__4_, Airport_buffer_shp, "15000 Meters", "FULL", "ROUND", "ALL", "", "PLANAR")

 

This script finds all attributes labeled “airport” in a dataset and creates a 15km buffer around each one. By integrated SQL queries, data can easily be parsed and presented. All of this code be generated using model builder in the ArcMap client. Efficient scripting comes in the form of the efficient application of the python functional logic with a clear realization of the objective to be achieved.

 


import arcpy
arcpy.env.workspace = "H:/Exercise12"

def countstringfields():
fields = arcpy.ListFields("H:/Exercise12/streets.shp"," ", {"String"})
namelist = []
for field in fields:
	namelist.append(field.type)
	print(len(namelist))

countstringfields

 

This script counts the number of “string” fields in a table. The function “countstringfields” starts by locating the “String” attribute in the attribute table of a shapefile. Next a list of names in defined. A loop then appends the type “String” to a list. The variable “fields” instructs the loop to run through the entire list of strings, essentially counting them “by hand”. The resultant count is then printed for the user, all out of the ArcMap client. This script can be further developed by introducing variables for the shapefile and datatype read from user input. The proper use of indentation and whitespace is an important part of Python syntax so when things like nested loops are used, special consideration should be taken. Scripts can also be used to update datasets in addition to parsing them.

 

<b></b>

import arcpy
from arcpy import env
env.workspace = "H:/Ex7/Exercise07/"
fc = "Results/airports.shp"
cursor = arcpy.da.UpdateCursor(fc, ["TOT_ENP"])
for row in cursor:
	if row[0] < 100000:
		cursor.deleteRow()
del row
del cursor

 

This script navigates a shapefile database using pointers. A loop is run for every row in the table representing an airport serving less than 100000 passengers, deleting it. It’s important to consider data integrity when altering geospatial databases. This kind of functionality is convenient when working with databases remote to the dataset or ArcGIS client, sometimes described as a “headless” method.

 


import arcpy
mxd = r"H:\Exercise10\Austin_TX.mxd"
mapdoc = arcpy.mapping.MapDocument(mxd)

for df in arcpy.mapping.ListDataFrames(mapdoc):
	print "Data frame " + df.name + " contains the following layers:"
	lyrlist = arcpy.mapping.ListLayers(mapdoc, "", df)
		for lyr in lyrlist:
		print lyr.name
del mapdoc

 

The script above uses a nested loop (a loop within a loop) to list to print the dataframes in a .mxd document. This is useful for interfacing with files through the command line without using the ArcMap client as a mediary.

This is a simple program that tells the user whether a set of points is a valid triangle and, if valid, what type between Equilateral, Scalene, or Isosceles. When working with spatial coordinates that rely on some degree of triangulation, this program can check to see if the points are applicable. This is the large draw of programming, I’ve noticed; to perform repetitive tasks accurate. Think about the most mind-numbing work you know. It might be math. Imagine for a second you’re tasked with applying the algebraic logic above to triangles, day after day. You would be good at it, but you will make mistakes, which a properly programmed computer will not. Human error, due to fatigue or inattention is not present in computer automated operations. This is the key concept of GIGO, or Garbage In, Garbage Out, the concept that if your code is garbage, you’ll get garbage from it. The following script explores the idea of performing additional geometric calculations with python. It looks for the type of triangle as well as the validity.

 


## Josh Dean
## 2/9/17
## GEOG 4103
## HW4-3

##variables

count = 0
a=0
b=0
c=0
listA = [5, 6, 7]
listB = [7, 6, 5]
listC = [1, 9, 1]
listD = [9, 1, 9]

##functions

def triangle_type(a, b, c):
if a == b == c:
	print("Type: Equilateral triangle")
elif a != b != c:
	print("Type: Scalene triangle")
else:
	print("Type: Isosceles triangle")

def triangle_validity(a, b, c):
if (c > a + b) or (a > b + c) or (b > a + c):
	print "Valid: No"
else:
	print "Valid: Yes"

##inputs

print("Feeding program lists of measurements...")

while count < 4:
	if count is 0:
		print listA
		a, b, c = listA
		count = count + 1
		triangle_type(a, b, c)
		triangle_validity(a, b, c)
	elif count is 1:
		print listB
		a, b, c = listB
		count = count + 1
		triangle_type(a, b, c)
		triangle_validity(a, b, c)
	elif count is 2:
		print listC
		a, b, c = listC
		count = count + 1
		triangle_type(a, b, c)
		triangle_validity(a, b, c)
	else:
		print listD
		a, b, c = listD
		count = count + 1
		triangle_type(a, b, c)
		triangle_validity(a, b, c)

 

The following script was fun to make.

This script accepts input in the form of multiple lists. These lists are preprogrammed in this case but the could read user input or read input from a text file by using the read method in python. The while loop uses a counter to track how many times it has been run. The loop is nested with with some conditional elements.

 


import csv

yield1999 = []
yield2000 = []
yield2001 = []

f = open('C:\Users\jdean32\Downloads\yield_over_the_years.csv')
csv_f = csv.reader(f)
next(csv_f)
for row in csv_f:
yield1999.append(row[0])
yield2000.append(row[1])
yield2001.append(row[2])

yield1999 = map(float, yield1999)
yield2000 = map(float, yield2000)
yield2001 = map(float, yield2001)

f.close()

print("1999: %s") %(yield1999)
print("2000: %s") %(yield2000)
print("2001: %s") %(yield2001)

year1999 = 1999
max_value_1999 = max(yield1999)
min_value_1999 = min(yield1999)
avg_value_1999 = sum(yield1999)/len(yield1999)
print("\nIn %s, %s was the maximum yield, %s was the minimum yield and %s was the average yield.") %(year1999, max_value_1999, min_value_1999, avg_value_1999)

year2000 = 2000
max_value_2000 = max(yield2000)
min_value_2000 = min(yield2000)
avg_value_2000 = sum(yield2000)/len(yield2000)
print("In %s, %s was the maximum yield, %s was the minimum yield and %s was the average yield.") %(year2000, max_value_2000, min_value_2000, avg_value_2000)

year2001 = 2001
max_value_2001 = max(yield2001)
min_value_2001 = min(yield2001)
avg_value_2001 = sum(yield2001)/5
print("In %s, %s was the maximum yield, %s was the minimum yield and %s was the average yield.") %(year2001, max_value_2001, min_value_2001, avg_value_2001)

 

Like I said before, this was fun to make. Always eager to take the road less traveled I thought of the most obtuse way to make this calculation. The objective of the above script was the read text from a file and compare 3 years of agriculture data. The script then finds which year had the minimum yield, maximum yield, average yield. This is all accomplished with a quick for loop, relying on several sets of variables to make sure the final answers are correct. This program can ingest different kinds of input so changing the text file, or the location where it is looked for will produce different results from the same script. Different data can be automatically ran through this particular operation.

 


yield1999 = [3.34, 21.8, 1.34, 3.75, 4.81]
yield2000 = [4.07, 4.51, 3.9, 3.63, 3.15]
yield2001 = [4.21, 4.29, 4.64, 4.27, 3.55]

location1 = (yield1999[0] + yield2000[0] + yield2001[0])/3
location2 = (yield1999[1] + yield2000[1] + yield2001[1])/3
location3 = (yield1999[2] + yield2000[2] + yield2001[2])/3
location4 = (yield1999[3] + yield2000[3] + yield2001[3])/3
location5 = (yield1999[4] + yield2000[4] + yield2001[4])/3

locations = [location1, location2, location3, location4, location5]
count = 1

text_fileA = open("C:\Temp\OutputA.txt", "w")

for i in locations:
	text_fileA.write(("The average yield at location %s between 1999 and 2001 %.2f\n") %(count, i))
	count = count + 1

text_fileA.close()

max1999 = (yield1999.index(max(yield1999))+1)
max2000 = (yield2000.index(max(yield2000))+1)
max2001 = (yield2001.index(max(yield2001))+1)
min1999 = (yield1999.index(min(yield1999))+1)
min2000 = (yield2000.index(min(yield2000))+1)
min2001 = (yield2001.index(min(yield2001))+1)

minmax1999 = [1999, max1999, min1999]
minmax2000 = [2000, max2000, min2000]
minmax2001 = [2001, max2001, min2001]

minmax = [minmax1999, minmax2000, minmax2001]

text_fileB = open("C:\Temp\OutputB.txt", "w")

for i in minmax:
	text_fileB.write(("In %s we yielded the least at location %s and the most at location %s.\n") %(i[0], i[2], i[1]))

text_fileB.close()

 

Another attempt at the agriculture problem. Versioning is something I find useful not only for keeping a record of changes but also for keeping track of progress. This was the 4th versioning of this script and I think it turned out very unorthodox, something I find most intersting about coding: you can find multiple approaches to complete an objective. The two scripts above are similar and approached in different ways. This script uses a for loop to run through a conextually sensitive amount of inputs. The values were hardcoded into the program as variables at the start of the script. They could be read from a file if necessary.

The following script looks for the basin layer in a ArcMap file and clips the soils layer using the basin layer. This produces an area where both the soil layer and the basin layer is present. From this clipped soil layer, the script goes on to select the features from a set of attributes that are “Not Prime Farmland”. This is useful for property development where the amount of farmland available is a consideration.

 

 


import arcpy

print "Starting"

soils = "H:\\Final_task1\\soils.shp"
basin = "H:\\Final_task1\\basin.shp"
basin_Clip = "C:\\Users\\jdean32\\Documents\\ArcGIS\\Default.gdb\\basin_Clip"
task1_result_shp = "H:\\task1_result.shp"

arcpy.Clip_analysis(soils, basin, basin_Clip, "")

arcpy.Select_analysis(basin_Clip, task1_result_shp, "FARMLNDCL = 'Not prime farmland'")

print "Completed"

 

The next script clips all feature classes from a folder called “USA” according to the Iowa state boundary. It then places them in a new folder. This is useful if you have country-wide data but only want to present the data from a particular area, in this case Iowa.

The script will automatically read all shapefiles in the USA folder, no matter the amount.

 

 


import arcpy

sourceUSA = "H:\\Final_task2\\USA"
sourceIowa = "H:\\Final_task2\\Iowa"
iowaBoundary = "H:\\Final_task2\\Iowa\\IowaBoundary.shp"

arcpy.env.workspace = sourceUSA
fcList = arcpy.ListFeatureClasses()

print "Starting"

for features in fcList:
	outputfc = sourceIowa + "\\Iowa" + features
	arcpy.Clip_analysis(features, iowaBoundary, outputfc)

print "Completed"

 

The following script finds the average population for a set of counties in a data. By dividing the total population by the number of counties, the average population is found. This is useful for calculating values in large datasets without doing it by hand.

 

 


import arcpy

featureClass = "H:\\Final_task3\\Counties.shp"

row1 = arcpy.SearchCursor(featureClass)
row2 = row1.next()

avg = 0
totalPop = 0
totalRecords = 0

while row2:
	totalPop += row2.POP1990
	totalRecords += 1
	row2 = row1.next()

avg = totalPop / totalRecords
print "The average population of the " + str(totalRecords) + " counties is: " + str(avg)

 

The following is a modified script the calculates the driving distance between two locations. Originally the script calculated the distance between UNCC and uptown. It has been edited to calculate user input. The API is finicky so the variables have to be exact to call the right data from the API. There is reconciliation of user input in the form of replacing spaces with underscores.

 


## Script Title: Printing data from a URL (webpage)
## Author(s): CoDo
## Date: December 2, 2015

# Import the urllib2 and json libraries
import urllib2
import json
import re

originaddress = raw_input("What is the address?\n")
originstate = raw_input("What is the state?\n")
originzip = raw_input("What is the zipcode\n")
destinationaddress = raw_input("What is the destination address?\n")
destinationstate = raw_input("What is the state?\n")
destinationzip = raw_input("What is the destination zipcode\n")

print originaddress
print originstate
print originzip
print destinationaddress
print destinationstate
print destinationzip

originaddress = originaddress.replace (" ", "+")
destinationaddress = destinationaddress.replace (" ", "+"# Google API key (get it at https://code.google.com/apis/console)

google_APIkey = ##removed for security

# Read the response url of our request to get directions from UNCC to the Time Warner Cable Arena
url_address = 'https://maps.googleapis.com/maps/api/directions/json?origin=%s,%s+%s&destination=%s,%s+%s&key='% (originaddress, originstate, originzip, destinationaddress, destinationstate, destinationzip) + google_APIkey
##url_address = 'https://maps.googleapis.com/maps/api/directions/json?origin=1096+Meadowbrook+Ln+SW,NC+28027&destination=9201+University+City+Blvd,NC+28223
url_sourceCode = urllib2.urlopen(url_address).read()

# Convert the url's source code from a string to a json format (i.e. dictionary type)
directions_info = json.loads(url_sourceCode)

# Extract information from the dictionary holding the information about the directions
origin_name = directions_info['routes'][0]['legs'][0]['start_address']
origin_latlng = directions_info['routes'][0]['legs'][0]['start_location'].values()
destination_name = directions_info['routes'][0]['legs'][0]['end_address']
destination_latlng = directions_info['routes'][0]['legs'][0]['end_location'].values()
distance = directions_info['routes'][0]['legs'][0]['distance']['text']
traveltime = directions_info['routes'][0]['legs'][0]['duration']['value'] / 60

# Print a phrase that summarizes the trip
print "Origin: %s %s \nDestination: %s %s \nEstimated travel time: %s minutes" % (origin_name, origin_latlng, destination_name, destination_latlng, traveltime)

 

This next script looks for feature classes in a workspace and prints the name of each feature class and the geometry type. This would be useful for parsing datasets and looking for specific features, like polygons.

 


import arcpy
from arcpy import env
env.workspace = "H:/arcpy_ex6/Exercise06"
fclist = arcpy.ListFeatureClasses()
for fc in fclist:
	fcdescribe = arcpy.Describe(fc)
	print (fcdescribe.basename + " is a " + str.lower(str(fcdescribe.shapeType)) + " feature class")

 

This following scrip adds a text field to an attribute table for roads. The feature class is called ferry and is populated by either “yes” or “no” values, depending on the value of the feature field.

This is useful for quickly altering data in an attribute field or dataset without directly interfacing with the ArcMap client.

 


##libraries
import arcpy
from arcpy import env
env.workspace = "C:/Users/jdean32/Downloads/Ex7/Exercise07"

##variables
fclass = "roads.shp"
nfield = "Ferry"
ftype = "TEXT"
fname = arcpy.ValidateFieldName(nfield)
flist = arcpy.ListFields(fclass)

if fname not in flist:
	arcpy.AddField_management(fclass, fname, ftype, "", "", 12)
	print "Ferry attribute added."

cursor = arcpy.da.UpdateCursor(fclass, ["FEATURE","FERRY"])

for row in cursor:
	if row[0] == "Ferry Crossing":
		row[1] = "Yes"
	else:
		row[1] = "No"
		cursor.updateRow(row)
del cursor

 

The following script uses some familiar functionality as the airport before near the beginning of this article. It first creates a 15,000 meter buffer around airport features in a shapefile. In addition to the buffer around airports, the script creates a 7,500 meter buffer around airports that operate seaplanes. This requires looking at the attribute table for seaplane bases. The end result is two separate buffers. A picture says 1,000 words, by having two buffers we are multiplying the amount of data that can be projected by a cartographic visualization.

 


# -*- coding: utf-8 -*-
# ---------------------------------------------------------------------------
# airport_buffer.py
# Created on: 2017-03-30 09:06:41.00000
# (generated by ArcGIS/ModelBuilder)
# Description:
# ---------------------------------------------------------------------------

# Import arcpy module
import arcpy

# Local variables:
airports = "airports"
airports__4_ = airports
Airport_buffer_shp = "H:\\Ex7\\Exercise07\\Challenge\\Seaplane_base_buffer.shp"

# Process: Select Layer By Attribute
arcpy.SelectLayerByAttribute_management(airports, "NEW_SELECTION", "\"FEATURE\" = 'Seaplane Base'")

# Process: Buffer
arcpy.Buffer_analysis(airports__4_, Airport_buffer_shp, "7500 Meters", "FULL", "ROUND", "ALL", "", "PLANAR")

 

Finally, we have a script the looks through a geodatabase, reads the features classes, then copies the polygon features to a new geodatabase. Once again, this makes it easy to parse and migrate data between sets.

 


import arcpy, os
arcpy.env.workspace = r'H:\arcpy_ex6\Exercise06'
fclass = arcpy.ListFeatureClasses()

outputA = r'H:\arcpy_ex6\Exercise06\testA.gdb'
outputB = r'H:\arcpy_ex6\Exercise06\testB.gdb'

for fc in fclass:
	fcdesc = arcpy.Describe(fc).shapeType
	outputC = os.path.join(outputA, fc)
	arcpy.CopyFeatures_management(fc, outputC)
	if fcdesc == 'Polygon':
		outputC = os.path.join(outputB, fc)
		arcpy.CopyFeatures_management(fc, outputC)

 

Python is a blessing for geographers who want to automate their work. It’s logical but strict syntax allows easily legible code. It’s integration with the ArcGIS suite and it’s fairly simple syntax make it easy to pick up, for experts and beginners alike. Model builder abstracts the programming process and makes it easier for people who are familiar with GUI interfaces. There is little a geographer with a strong knowledge of python and mathematics can’t do.

Josh Dean Concord Charlotte