Monday, August 27, 2012


Analyzing air traffic performance with Vectorwise


Credit for the idea for this blog goes to:

So I read the above blog and I decided to put Actian Vectorwise to the test. Vectorwise is the record holder of the fastest TPC-H benchmark results. I went on what I call RITA (US Department of Transportation Research and Innovative Technology Administration) website. There you will find a plethora of data that they collect but I wanted the most data available (number of rows) - the OnTime table has that.

From their website, here is what the OnTime data covers:
Airline on-time data are reported each month to the U.S. Department of Transportation (DOT), Bureau of Transportation Statistics (BTS) by the 16 U.S. air carriers that have at least 1 percent of total domestic scheduled-service passenger revenues, plus two other carriers that report voluntarily. The data cover nonstop scheduled-service flights between points within the United States (including territories) as described in 14 CFR Part 234 of DOT's regulations. Data are available since January 1995. The following statistics are available:

Summary Statistics - All and late flights (total number, average departure delay, average taxi-out and average scheduled departure) and late flights (total and percent of diverted and cancelled flights).

Origin Airport 
Destination Airport 
Origin and Destination Airport 
Airline 
Flight Number 

Detailed Statistics - Departure and arrival statistics (scheduled departure time, actual departure time, scheduled elapse time, departure delay, wheels-off time and taxi-out time) by airport and airline; airborne time, cancellation and diversion by airport and airline. 

Departures 
Arrivals 
Airborne Time 
Cancellation 
Diversion 

In other words, Summary statistics performs aggregation on the data and shows you a small table with the results, while Detailed Statistics actually fetches the rows from the database.
As of Sept. 2012, the number of rows in the single ontime table is ~145 million.

I decided to run the queries mentioned in the blog above, but I also wanted to see how my experience and the experience of other users would be if the Department of Transportation used Vectorwise as the underlying database powering their website.
The Airline Summary Statistics looked like a good start.

To see the performance of their website (I have no idea what kind of machine(s) they are using, but I have an idea of the underlying database - it's from one of the largest vendors in the world) you can visit:
http://www.bts.gov/xml/ontimesummarystatistics/src/ddisp/OntimeSummarySelect.xml?tname=OntimeSummaryAirlineData

In their page, there is a silly note:
NOTE: Due to the large amount of data to be searched, time period should be limited to one year.

Right there you can see that they have a Big Data problem and that their database is not suited to the task of analyzing large amounts of data. In all reality 145 million rows is trivial for Vectorwise but their row oriented database simply can't handle it.
As a side note, my personal home machine  is a custom built, Intel i7-2600K CPU @3.4GHz with 16GB of RAM.

So go ahead, visit their website, select American Airlines from Jan 1st 2011 - Jan 1st 2012 and hit submit. You get to see how painfully long it takes to run the queries. Do not select more than 1 year as it will either hang or error out. Imagine I had to do real analysis work with this interface, i.e. let's say I wanted to see how many flights Southwest Airlines had from Jan 1st 1995 - Jan 1st 2012. I would have to select Southwest from the drop down, once for every year, 17 times * 3 minutes average time, means I would have to wait almost 1 hour  to find out the answer. Simply unacceptable.
To their defense, I'm sure they are working with limited budget and this database has been in place for quite sometime but it showcases the need for an upgrade of their outdated software.


Downloading the data
Data is actually available from October 1987 until today (they are a few months behind in entering the latest data). Their drop-down in their website however only goes back to 1995. Data comes in a .zip file for every month, so I downloaded 296 files. Here is the direct link to download the files:
http://www.transtats.bts.gov/Download/On_Time_On_Time_Performance_1988_1.zip

Just  replace 1988 with your preferred year and 1 with your preferred month, so July of 2011 would be:
http://www.transtats.bts.gov/Download/On_Time_On_Time_Performance_2011_7.zip

The newer .zip files are about 28MB in size, and they unzip to about 230MB. So uncompressed the size of all the data is approximately 59GB.

Creating the table
To load the data in Vectorwise, we first have to create the ontime table. Here is my create table script:

create table ontime(
        year integer default NULL,
        quarter i1 default NULL,
        month i1 default NULL,
        dayofmonth i1 default NULL,
        dayofweek i1 default NULL,
        flightdate ansidate default NULL,
        uniquecarrier char(7) default NULL collate ucs_basic,
        airlineid integer default NULL,
        carrier char(2) default NULL collate ucs_basic,
        tailnum varchar(50) default NULL collate ucs_basic,
        flightnum varchar(10) default NULL collate ucs_basic,
        originairportid integer default NULL,
        originairportseqid integer default NULL,
        origincitymarketid integer default NULL,
        origin char(5) default NULL collate ucs_basic,
        origincityname varchar(100) default NULL collate ucs_basic,
        originstate char(2) default NULL collate ucs_basic,
        originstatefips varchar(10) default NULL collate ucs_basic,
        originstatename varchar(100) default NULL collate ucs_basic,
        originwac integer default NULL,
        destairportid integer default NULL,
        destairportseqid integer default NULL,
        destcitymarketid integer default NULL,
        dest char(5) default NULL collate ucs_basic,
        destcityname varchar(100) default NULL collate ucs_basic,
        deststate char(2) default NULL collate ucs_basic,
        deststatefips varchar(10) default NULL collate ucs_basic,
        deststatename varchar(100) default NULL collate ucs_basic,
        destwac integer default NULL,
        crsdeptime integer default NULL,
        deptime integer default NULL,
        depdelay integer default NULL,
        depdelayminutes integer default NULL,
        depdel15 integer default NULL,
        departuredelaygroups integer default NULL,
        deptimeblk varchar(20) default NULL collate ucs_basic,
        taxiout integer default NULL,
        wheelsoff varchar(10) default NULL collate ucs_basic,
        wheelson varchar(10) default NULL collate ucs_basic,
        taxiin integer default NULL,
        crsarrtime integer default NULL,
        arrtime integer default NULL,
        arrdelay integer default NULL,
        arrdelayminutes integer default NULL,
        arrdel15 integer default NULL,
        arrivaldelaygroups integer default NULL,
        arrtimeblk varchar(20) default NULL collate ucs_basic,
        cancelled i1 default NULL,
        cancellationcode char(1) default NULL collate ucs_basic,
        diverted i1 default NULL,
        crselapsedtime integer default NULL,
        actualelapsedtime integer default NULL,
        airtime integer default NULL,
        flights integer default NULL,
        distance integer default NULL,
        distancegroup i1 default NULL,
        carrierdelay integer default NULL,
        weatherdelay integer default NULL,
        nasdelay integer default NULL,
        securitydelay integer default NULL,
        lateaircraftdelay integer default NULL,
        firstdeptime varchar(10) default NULL collate ucs_basic,
        totaladdgtime varchar(10) default NULL collate ucs_basic,
        longestaddgtime varchar(10) default NULL collate ucs_basic,
        divairportlandings varchar(10) default NULL collate ucs_basic,
        divreacheddest varchar(10) default NULL collate ucs_basic,
        divactualelapsedtime varchar(10) default NULL collate ucs_basic,
        divarrdelay varchar(10) default NULL collate ucs_basic,
        divdistance varchar(10) default NULL collate ucs_basic,
        div1airport varchar(10) default NULL collate ucs_basic,
        div1airportid integer default NULL,
        div1airportseqid integer default NULL,
        div1wheelson varchar(10) default NULL collate ucs_basic,
        div1totalgtime varchar(10) default NULL collate ucs_basic,
        div1longestgtime varchar(10) default NULL collate ucs_basic,
        div1wheelsoff varchar(10) default NULL collate ucs_basic,
        div1tailnum varchar(10) default NULL collate ucs_basic,
        div2airport varchar(10) default NULL collate ucs_basic,
        div2airportid integer default NULL,
        div2airportseqid integer default NULL,
        div2wheelson varchar(10) default NULL collate ucs_basic,
        div2totalgtime varchar(10) default NULL collate ucs_basic,
        div2longestgtime varchar(10) default NULL collate ucs_basic,
        div2wheelsoff varchar(10) default NULL collate ucs_basic,
        div2tailnum varchar(10) default NULL collate ucs_basic,
        div3airport varchar(10) default NULL collate ucs_basic,
        div3airportid integer default NULL,
        div3airportseqid integer default NULL,
        div3wheelson varchar(10) default NULL collate ucs_basic,
        div3totalgtime varchar(10) default NULL collate ucs_basic,
        div3longestgtime varchar(10) default NULL collate ucs_basic,
        div3wheelsoff varchar(10) default NULL collate ucs_basic,
        div3tailnum varchar(10) default NULL collate ucs_basic,
        div4airport varchar(10) default NULL collate ucs_basic,
        div4airportid integer default NULL,
        div4airportseqid integer default NULL,
        div4wheelson varchar(10) default NULL collate ucs_basic,
        div4totalgtime varchar(10) default NULL collate ucs_basic,
        div4longestgtime varchar(10) default NULL collate ucs_basic,
        div4wheelsoff varchar(10) default NULL collate ucs_basic,
        div4tailnum varchar(10) default NULL collate ucs_basic,
        div5airport varchar(10) default NULL collate ucs_basic,
        div5airportid integer default NULL,
        div5airportseqid integer default NULL,
        div5wheelson varchar(10) default NULL collate ucs_basic,
        div5totalgtime varchar(10) default NULL collate ucs_basic,
        div5longestgtime varchar(10) default NULL collate ucs_basic,
        div5wheelsoff varchar(10) default NULL collate ucs_basic,
        div5tailnum varchar(10) default NULL collate ucs_basic
)
Loading the data
After you create the table you need to load the data. Vectorwise has a fast loader, that can load anywhere from 80,000 to 500,000 rows per second, depending on the speed of your hardrive(s).

The command to load the data on the Windows version of Vectorwise is:
vwload -H -f ,  -q """" -I -t ontime ontime On_Time_On_Time_Performance_2011_11.csv

If you wanted to try this on Linux, you can say:
vwload -H -f , -q '"' -I -t ontime ontime On_Time_On_Time_Performance_*.csv

vwload is the name of the fastloader, -H means ignore the first line which is the header, -f , means use comma as the delimiter, -q means that is the quote, -t ontime means the table is called ontime, the next ontime is the name of the database and the last argument is the .csv file that you want to load.

Loading times on my machine are about 5-6 seconds for each .csv file (I have a 15TB NAS drive with 7200rpm hard disks, not the fastest setup in the world), so it would take me about 28 minutes to load the whole table - still an amazing time.

Data size after load is ~16.6GB, which means Vectorwise compresses this data set by a factor of 3.5.

Running queries
We can now start running queries (yes, with Vectorwise you don't have to pre-aggregate data, create cubes or indexes, just load your data and you're done) - let's see how fast Vectorwise is: (hot times)

Q0: SELECT count(*) FROM ontime - 0.112 seconds
Q1: SELECT DayOfWeek, count(*) AS c FROM ontime WHERE Year BETWEEN 2000 AND 2008 GROUP BY DayOfWeek ORDER BY c DESC - 0.241 seconds
Q2: SELECT DayOfWeek, count(*) AS c FROM ontime WHERE DepDelay>10 AND Year BETWEEN 2000 AND 2008 GROUP BY DayOfWeek ORDER BY c DESC - 0.283 secs
Q3: SELECT Origin, count(*) AS c FROM ontime WHERE DepDelay>10 AND Year BETWEEN 2000 AND 2008 GROUP BY Origin ORDER BY c DESC fetch first 10 rows only - 0.347 seconds
Q4:  SELECT carrier, count(*) FROM ontime WHERE DepDelay>10  AND Year=2007 GROUP BY carrier ORDER BY 2 DESC -0.126 seconds
Q5: WITH t AS (SELECT carrier, count(*) AS c FROM ontime WHERE DepDelay>10  AND Year=2007 GROUP BY carrier), t2 AS (SELECT carrier, count(*) AS c2 FROM ontime WHERE Year=2007 GROUP BY carrier) SELECT t.carrier, c, c2, c*1000/c2 as c3 FROM t JOIN t2 ON (t.Carrier=t2.Carrier) ORDER BY c3 DESC - 0.257 seconds
Q6: Same as above but from 2000 - 2008:WITH t AS (SELECT carrier, count(*) AS c FROM ontime WHERE DepDelay>10  AND Year between 2000 and 2008 GROUP BY carrier), t2 AS (SELECT carrier, count(*) AS c2 FROM ontime WHERE Year between 2000 and 2008 GROUP BY carrier) SELECT t.carrier, c, c2, c*1000/c2 as c3 FROM t JOIN t2 ON (t.Carrier=t2.Carrier) ORDER BY c3 DESC - 0.976 seconds
Q7with t as (select YEAR,count(*)*1000 as c1 from ontime WHERE DepDelay>10 GROUP BY Year), t2 as (select YEAR,count(*) as c2 from ontime GROUP BY YEAR) select t.YEAR, c1/c2 FROM t JOIN t2 ON (t.YEAR=t2.YEAR) - 0.826 seconds
Q8:  SELECT DestCityName, COUNT( DISTINCT OriginCityName) FROM ontime WHERE Year BETWEEN 1999 and 2009 GROUP BY DestCityName ORDER BY 2 DESC fetch first 10 rows only;
1 year: 0.433 seconds
2 years: 0.646 seconds
10 years: 2.776 (finally a query that took more than 1 second)
Q9select year,count(*) as c1 from ontime group by YEAR - 0.405 seconds

You can see that most queries return in sub-second times on my $700 Home PC. I actually sometimes demo this application running on my Dell E6410 laptop, that has 4GB of RAM and a 5400rpm hard drive.
You have to see people's amazement when my queries return in seconds, while the gov. website is still thinking...
So you can see Vectorwise achieves these tremendous results on commodity hardware, no need to have clusters of expensive machines.
A new RITA!
I managed to reverse engineer their queries and create my own RITA website. I simply use php going against my Vectorwise installation on my home PC. You can see it in action here:

http://bit.ly/PcLXyF
(if you can't get to it, sorry - it's my own personal machine that could be down from time to time).

Feel free to select for e.g. American Airlines, and ANY start year you want, the press Submit.
The longest queries take 2.5 seconds - select Southwest Airlines from 1995 to today, as SW has the most number of flights out of any airline.

Contrast that with the minutes it takes for the actual website to run 1 year's worth of analysis and it's inability to analyze more than one year of data and you start seeing the amazing power of Vectorwise.


Disruptive Technology
Imagine your end users getting answers to their data instantly instead of having to wait minutes or hours for the answer.
Imagine your favorite BI tool giving you answers in real time, on your computer screen or on the iPad that is in your hands (store managers can find out inventory or sales in real time, instead of having to wait for some nightly batch job to run and get the answer the next day, essentially stale data).
Imagine being able to load deltas in seconds WHILE your users are running queries - no downtime necessary to load the data.

You can stop imagining, Vectorwise is here and available today.

Disclaimer: I work for Actian and it's sometimes hard to curb my enthusiasm about Vectorwise.


More to come: I plan on benchmarking other databases and will write on various subjects about Vectorwise. Feel free to comment below.