The product of a preppy suburban Boston upbringing and fratastic yet hipster influenced Mid-Western university experience, he is currently a Reporter at Quartz, writing, creating interactive infographics, data visualizations, and news toys. Send him an email to email@example.com or a tweet @YAN0.
A friend of mine was at a party last night. The hosts pull her aside and say,
They take her into the bathroom.
“For the last month we’ve looked out this window in our shower and saw this big ‘No’ taped on that wall over there…we couldn’t figure out what it was or why it would be there. We were obsessed. We would try to figure it out all the time. But now it’s gone.
Today we put hastheusgoneoffthefiscalcliff.com to sleep after a month of service, so we wanted to explain how it came to be.
How did the site work?
We had a DSLR plugged into a AC power supply, on a tripod, hooked up to a Mac Mini with a USB cable:
- The Mac Mini ran a bash script every 5 minutes through the crontab
- the script triggered a camera capture through the USB cable and downloaded the image
- the script created two smaller resized copies of the image (one for the site one for social media use)
- the script uploaded those images to our web server, replacing the previous captures
- the script put a timestamp in the full sized image’s filename and moved it to an archive on the Mac Mini (for posterity)
"http://hastheusgoneoffthefiscalcliff.com/imagecapture_1000.jpg?timestamp=" + (new Date()).getTime()
This redefined the path to the image every 2.5 minutes; appended with the date stamp as a parameter to make sure we dont get a cached version of the image.
The clickable areas were defined in a Google spreadsheet that was loaded in every time the page loaded and on each subsequent image replacement. We updated this document by hand every time we changed the wall.
How did the wall work?
The hard way: Every morning we got up, printed out some headlines, tweets, quotes and pictures, tiled them together and taped them to the wall.
The idea for the single serving site from the beginning was Zach’s. Sometime in October he noticed that that the hastheusgoneoffthefiscalcliff.com domain was available to register, and he brought it up as something maybe worth pursuing. The first idea he sketched out was this:
A man who slowly inches towards the edge of a cliff paired with links to stories around the web about the topic and an explanation of what the fiscal cliff was. That conversation devolved into the merits of different depictions of cliffs:
After seeing Brian Rea’s coverage of the US Presidential Election Night on Instagram and remembering the website for Sagmeister and Walsh, I thought about making a webcam of a wall of stuff. I made this as a proof of concept:
Everyone got on board, and this is what resulted over the next month:
- David Yanofsky
Development time: 2 days
Imagine the human being who took the time to make this. We must deeply honor the focus of that person.
Apparently it seemed like a crazy task to filter through almost 100 years of documents and tabulate information about them. Let’s asume they thought I was doing this by hand.
I wasn’t. Not even the GIF. Heres how:
Making the GIF
There’s a command line tool called ImageMagick that will both turn a PDF into a series of images and then turn that series of images into a GIF. These are the two commands using imagemagick I used to accomplish this:$ for infile in *.pdf; do convert -density 400 -resize 400 -trim -extent 500x700 -gravity north $infile jpeg/$infile.jpg; done
$ convert -delay 25 -loop 0 jpeg/f1040__*-0.jpg animated1040.gif
The first line tells ImageMagick to look at every PDF in the current directory, convert it to a 400px-wide PNG using a resolution of 300ppi for vector data, extend the edges of the image to 500px by 700px (anchoring the image to the top center of the new bounds), save it in the folder named jpeg. The second line tells ImageMagick to merge every .jpg file in the jpeg folder (i.e., every file I just created) with a file name ending in “-0.jpg” (this is the first page of the former PDF) into a GIF called “animated1040.gif” that flips through each image at 25 hundredths of a second and loops continuously.
After cleaning and optimizing it in Photoshop I had this.
Finding the files
All of this was dependent on having all of these 1040s. When I started looking for them, I was hoping some think tank or library would have an archive of the documents.
I decided to start simple. The current form is easy to find. A web search for “1040” revealed the IRS served PDF as the top result. Now what about the old forms? A web search for “2010 Form 1040” also returned a PDF on the IRS website but it had a slightly different URL: www.irs.gov/pub/irs-prior/f1040—2010.pdf. “irs-prior” — I like the look of that — “f1040—2010.pdf” Could all of the filenames be systematized?
Yes! A couple minutes of URL manipulation in my browser allowed me to find that there were files at this URL dating back to 1913 (though there were no forms for 1914 and 1915, since those years used the same 1913 form).
Downloading the docs
The next step was to download all of the files. Should I change the year in each URL and save as from my browser? TERRIBLE IDEA. I opened up my command line used the interactive prompt of python to download all the files super quick. It went something like this:$ python
>>> import urllib
>>> years = range(1916,2012)
>>> for y in years:
. . . urllib..urlretrieve("http://www.irs.gov/pub/irs-prior/f1040--%s.pdf" % (y,), "f1040--%s.pdf" % (y,))
. . .
Here’s what that means:
- start python
- load the library I need to download files
- create a list of years that I want to download: start in 1916 end in 2011 (one year before 2012) call it “years”
- cycle through every year in that list calling the current year “y”
- download the file using the url and naming system I figured out before, save the file using the same system
Two minutes later there were 97 PDFs in my folder for this project. I opened up the 1913 form in my browser and downloaded it. BOOM. Every 1040 ever.
So I now we had all these files and we had to quantify exactly how much more complex they got over time. My first idea was to use the amount of ink used on each document as a proxy for complexity. I wanted to count the number of black pixels in each document. I used ImageMagick to convert all the PDFs to images and could start counting pixels.
Using a Python library called PIL, I opened up each file with Python, converted it to grayscale and counted the number of black pixels, calculated the ratio of black-to-total pixels, associated that JPEG with the appropriate year and save that information as a JSON blob and CSV spreadsheet. I’ll save you from that code here, but you can see it here on github.
Using the CSV I got out of that, I made this chart showing the amount of “ink” on the form over time:
It was antithetical to what we knew was true. If the amount of printing was a proxy to complexity, this chart would show that the tax code is less complex than the first years of the system.
Were the older documents just bigger? Use larger type? I charted the same information but as a ratio of amount of black per page. Same story. Then I realized that the older documents have more instructions on them! (More recent 1040s include instructions in a separate appendix.) What if we just looked at the tabulation page. No luck. Apparently today’s documents are more ink-efficient than those of yesteryear.
I crafted a new strategy: count the number of line items on the form. (Our methodology for what we counted is recounted in the piece.) The slow way to do this would be to double-click each file in a document viewer, count how many lines were on each page and input that into a spreadsheet, hoping I dont miss anything or make a typo. (It was beer o’clock in the office.)
The fast way is to write more code. I created another Python script that would open up each page of every document individually and prompt me to enter how many lines were on that page and whether I should overwrite the current number of lines I recorded or add these to the number of lines already recorded. Once complete, the script saved a spreadsheet of the recorded information.
I made this chart.
Thinking about the transfer of instructions from the form to a separate document, I decided to take a look at the instructions booklet and see how those have changed over time. I used the same exact scripts as above to download all of the instruction files by changing the URL slightly to “i1040” from “f1040.” (This naming convention was also revealed by a web search for “1992 form 1040 instructions.”)
The recent documents were long: 2011 is nearly 200 pages. I used some more code to count the number of pages, and it looked like this:$ python
>>> from pyPdf import PdfFileReader
>>> years = range(1939,2012)
>>> for y in years:
. . . print y, PdfFileReader(file("i1040--%s.pdf"%(y,),"rb")).getNumPages()
. . .
I copied the data from the output (I didn’t save this to into a file for speed’s sake), pasted it into Excel, and made this chart:
Fifteen years ago, tax instructions were half the size! More striking, the booklets from the ’80s have smaller pages, but were still significantly shorter than today’s.
So now I had a GIF, three charts, and a whole bunch data. All that was left was words.
Read them all here: Line for line, US income taxes are more complex than ever
Development Time: 1 day
I was doing some video processing in python today and I couldn’t find any good answers on how to save rightside-up images using pyglet. Apparently pyglet saves them upside-down by default. Here’s my solution:
#load the video
video = pyglet.media.load(“myVideo.wmv”)
#get the first frame
frame = video.get_next_video_frame()
#get the image data of the first frame
imageData = frame.get_image_data()
#get the pixels of the frame but invert the pitch
pixels = imageData.get_data(imageData.format,imageData.pitch *-1)
#set inverted pixels to the image
#save the image