Sens8 Season 1 Review

Just finished watching Sens8 Season 1 on Netflix.

Overall:

8 people are linked psychically and can transcend space to “visit” one another and can “share” abilities. It’s the anti-superhero superhero show because each individual as humanly-achievable skills. It tells each of their stories with increasing intersections, which at times works but at others leaves your head scratching.

Execution:

It’s a difficult show to watch because the pace is so slow and frankly, the payoff isn’t worth it. Most of the storylines are not that interesting, the acting is at time bland, and it certainly didn’t draw me in to care enough about any one character. It was also perplexing how there’s no real “holy crap wtf?” as characters just pop into each others lives.

The irony:

There were quite a few plot holes and things that make no sense. I expect more from the Wachowskis and  J. Michael Straczynski, so these were disappointing (and annoying).

Conclusion:

The only thing that made the show bearable was doing something else while half-paying attention to it. My wife and I were both up in our devices, I was writing / planning my week and then playing with our dogs. If we missed something or didn’t understand it, we continued and didn’t rewind because we felt like it whatever it was it wouldn’t actually be worth it.

There are plenty of critics review that panned the show and they are not wrong. I think everyone expected a lot more from the Wachowski/Straczynski team.

If the second season doesn’t pick it up from the start then I’ll probably not continue.

If you can watch only 1 show, watch this instead:

If you’re looking for a science fiction show with excellent writing, high-concept, and strong execution, look no further than Orphan Black. IMHO it’s the best show on right now.

Rants:

I was imaging the creators brainstorming this show:

“Let’s do a show like Leverage where we’ve got people who fit a stereotype: a hacker, an actor, a tough guy, a thief…”

“But no Robin Hood act like Leverage, so let’s do a Paranormal Smurfs but make it extraordinarily slow and boring!”

“Trans Smurf who knows how to hack!”

“Tough Smurf!”

“Oh we need to be racist – let’s throw in some stereo types!”

“Doctor Smurf! Let’s make her Indian!”

“Kung Fu Smurf! Let’s make her Asian!”

“Driver Smurf! Let’s make him Black!”

“Good Guy Smurf! Let’s make him a White Guy!”

“Bad Guy Smurf! No wait! Bad Guy Corporation headed by Evil White Guy Smurf!”

“We need a con artist like Face Man from the A Team. Let’s make him Spanish and Gay!”

“Let’s also add a completely useless character who has no real skills other than dropping the needle on the record and build a major plot line about her!”

Read More

How to auto tweet and create your own Twitter bot

Note: I implemented this myself today (May 3, 2015) and it works fine. Sources referenced for this post are

Matt Hopkins: Business, Marketing, Technology

SEO Smarty

Digital Inspiration, tech a la carte

 

Please don’t be intimidated by the number of steps involved — it’s not too bad, plus I explain everything in detail.

 

Step 1: Find the Source

We are going to use Twitter itself as the source of the content to retweet.

You will need a Twitter.com account. It is probably safer to set up a new one.

a. Log in, go to Settings -> Widgets (link) and create a new widget (down at the bottom of the image below). You can create widgets for user timelines, favorites, Twitter lists, collections and search results.

twitter-widget1

Then hit Create Widget on the right (see image below).

twitter-widget2

 

Then select if you want to use the straight feed of another user or search terms:

twitter-widget3

I entered the following search terms into the search tab:

 

“blackfish” OR “#blackfish -source:twitterfeed -filter:retweets lang:en

This asks Twitter for the most recent feeds that include “blackfish” or the hashtag “#blackfish” but ignore tweets published using twitterfeed and ignore retweets. This leaves only originals, non-bot tweets in english.

Here’s a screenshot of it in the search tab.

 

twitter-widget4

Hit create and in the URL of your browser, you’ll see an ID for the widget:

twitter_widget5

You can also grab it by going to the main widget page and editing the newly created widget, here’s mine:

https://twitter.com/settings/widgets/595021654989889536/edit

b. Next, click here to make a copy of the Google Script and choose Run -> Twitter_RSS to authorize the script. You’ve to do this only once.

TwitterRss

c. Then, go to Publish -> Deploy as Web App and click the “Save New Version” button. Set Anyone, including Anonymous under Who has access to the app and hit Deploy.
twitter-widget5

Once you hit deploy, you’ll get a window like this:

twitter-widget6

The next step is to append the Widget ID to the “Current web app URL” we just got from the window above:

 

https://script.google.com/macros/s/AKfycbwksl6hHZWT3PYOrd3qVXNuLwJWHP23f-pvuCxTyAy6hFE4KPc/exec

We do that by adding a “?” and the Widget ID, so you get this:

https://script.google.com/macros/s/AKfycbwksl6hHZWT3PYOrd3qVXNuLwJWHP23f-pvuCxTyAy6hFE4KPc/exec?595021654989889536

 

Step 2: Feedburner

Go to Feedburner and paste in the URL string above that includes your Widget ID:

twitter-widget7

Hit next, rename the feed title (if you wish) and hit next again.

twitter-widget8

If you did everything correctly, you’ll see:

twitter-widget9

 

Step 3: Yahoo Pipes

Head over to Yahoo Pipes (you may need to register if you don’t already have a yahoo.com account — it’s worth it!).

Yahoo Pipes gives you a lot of filtering control over your feed. You’ll see soon why it’s so cool.

a. Under Sources grab Fetch Feed  and drop it into your workspace. Paste the feedburner url into the box.

twitter-widget10

b. Under Operators (on the left), grab and drop a Filter onto the workspace.

twitter-widget11

This is where it gets powerful. You can set this filter to block all the criteria (e.g.,  an “and”) or any item on the list (e.g., an “or”). Here we set the rule to select item.title Contacts RT, which means we’re blocking retweets.

c. Drag and drop a Loop module from the Operators list. Then from the String list,

twitter-widget12

drag and drop a String Builder into the Loop module. Hit the + String a few times so you have 3 slots like this:

twitter-widget13

 

In the first box, select “item.author” but then edit it to read “item.author.uri”.In the 2nd box, put in a space, e.g., ” ” — click into the box, hit the space bar. In the 3rd box select “item.title” from the dropdown.

 

Finally, in the assign results box at the bottom, select “item.title.” It should look like this:

twitter-widget25

d. Go back to the Operators list and grab a Regex module and drop it into your workspace. We’re going to add “item.title” in the first box, replace “http://twitter.com” in the middle box, and with “RT @” in the last box. Note that there is a space between RT and the @. It should look like this:

twitter-widget15

 

e. Let’s wire everything all up! We’re going to from Fetch Feed into Filter into Loop into Regex then into Pipe Output, like this:

twitter-widget24

 

f. Up on top click on Save then name your pipe. Once saved, click on Run Pipe up on top as well.

twitter-widget17

When you click Run Pipe you’ll see something like the image below. Then click on the Get as RSS (in the middle top) and copy the URL.

twitter-widget18

In this case, I wind up with

http://pipes.yahoo.com/pipes/pipe.run?_id=b2d3bed719c10b166547164636bf5303&_render=rss

Step 4. TwitterFeed

Go to TwitterFeed (sign up for a free account if you don’t have one).

Enter in the Feed Name, and paste in the Pipes URL we just generated.

twitter-widget19

You can do some Advanced Options: I’ll set Post Content to include title & description and set it to 3 tweets.

twitter-widget23

Hit Continue to Step 2 and select Twitter as the recipient of the feed. You will have to authenticate:twitter-widget20

Once you hit Authorize App …

twitter-widget21

It will take you back to TwitterFeed. The last step is to hit Create Service and your twitterbot is live!

twitter-widget22

Finally, you can test the output from the TwitterFeed dashboard by hitting Check Now!

twitter-widget26

Have fun!

 

Note: modifying your settings can be a bit of a pain. Sometimes you have to go back to the Pipes, Save as new, run, get the RSS of the new pipe, and overwrite what you’ve done on TwitterFeed. Play around with it. It takes a little tweaking but once set, it’s great!

Read More

iPhone iOS8 vs. Android 5.0 Lollipop

This post is in response to CNN Money’s comparison of the iPhone vs. Android. In their article they compare iOS8 to Android 4.x Kit Kat and mark a blue circle with “win” or “tie” on top of their selection.

I add a red “win”, “tie”, “meh” or “hmm” on top of the CNN image indicating the winner or otherwise. Below the modified CNN image, I’ll add screenshots from Android 5.0 Lollipop from my Nexus 5 to back up my claims as needed.

 

ios8-vs-android5_lollipop_myscore

LOGIN

ios8-vs-android5_cnn_lockscreen

My Verdict: Meh

For me this is a personal preference: I don’t want Apple or any other corporation to have my fingerprint.

We all know that nothing is un-hackable – so for me it’s a personal choice that I don’t want to store that on my phone which could be lost, stolen, hacked, or co-opted by a corporation.

 

MAKE A CALL

ios8-vs-android5_cnn_phone

My Verdict: iPhone

The Lollipop phone app (below) improves on the Kit Kat experience, but iOS is still cleaner.

ios8-vs-android5_lollipop_phone

 

 

CHECK THE TIME

ios8-vs-android5_cnn_time

My Verdict: Meh

Who cares?

They both tell you the time.

One of them might save a nanosecond, but not really worth comparing or caring.

 

TAKE A PHOTO OR VIDEO

ios8-vs-android5_cnn_camera

My Verdict: Meh

For people into photography this might matter most.

For someone like me, pixels and picture quality have not been an issue since the iPhone 3.

With all the increases in mega pixels comes increases in file size, more device storage, more hard drive storage at home or cloud storage, more of a pain to email (either you shrink them or consume more bandwidth in transmission).

My dog Dana is beautiful, but I don’t need to see into the pores on her nose.

 

TYPE

ios8-vs-android5_cnn_keyboard

My Verdict: Android

I love love love the swipe typing on the Google stock keyboard. In fact, the stock keyboard is so good that I ditched all the 3rd party keyboard apps I had from my earlier foray venturing into Android land (Motorola Atrix year ago, just terrible).

At this point I can’t two finger type on a phone because I’m so used to swipe typing.

On Lollipop the experience is even better than it was on Kit Kat: the design is super clean, the keyboard is very usable and accurate.

Oh on my iPad 3 running iOS 8, I got the Swype keyboard. It was clunky, not smooth, and a painful experience. This may be because the iPad 3 is older hardware, but still, doesn’t compare to the fluid swipe experience on Android devices.

ios8-vs-android5_lollipop_email3

 

USE APPS

ios8-vs-android5_cnn_desktop

My Verdict: Android

@CNNMoney that’s just a painfully busy background image you selected.

Here, have a look at my Lollipop desktop  (running the Solo Launcher) and compare it with iOS in the image above.

ios8-vs-android5_lollipop_home2

I like to keep my desktop clean and use only 1 page. That’s my girl Dana as the wallpaper.

The bottom row of icons is the “dock” (similar to on a Mac). I have 2 pages of my dock, which means I can swipe that single row right / left and another 5 icons appear (I’m only using 4 slots, but can add one more to page 2 of the dock).

Top row: Google Maps, Chrome, Calendar, Photo Album, Camera

Bottom Row (my dock page 1): Phone, Contacts, App Drawer, Google Messenger (SMS),  Mail

Bottom Row (my dock page 2, not pictured): Timely (clocks, timer, alarm), Skype, Tapatalk, Evernote

All of the other apps are tucked away in the app drawer. Clean, beautiful, organized.

Not to harp on a played out point: Android’s ability to customize the desktop via 3rd party apps called “Launchers” allows you to pick your own experience. Many of these give you lots of control on look and feel, gestures, and much more. Unfortunately Apple has stayed the course with the same user experience as their first iPhone, which is in part why I got sick of iOS. There are a lot of other reasons. But mostly it’s because they have lacked innovation for years (they would have you believe otherwise).

As for App Stores – I find Google Play to be easier to navigate, and better organized.

 

NOTIFICATIONS

ios8-vs-android5_cnn_notifications

My Verdict: Android

Android’s been doing notifications better than iOS for a long time – that whole window shade pull down thing was on Android first.

That said, elements of iOS notifications were better – namely, the pop ups for like incoming SMS etc. On Kit Kat you’d need 3rd party apps to help with pop up type notifications. No more – with Lollipop they close the gap here.

The image below are notifications on Lollipop with no 3rd party apps installed. If I double tap an item, it opens that app directly. It’s pretty smooth.

ios8-vs-android5_lollipop_notification2

 

 

MUSIC / PODCASTS

ios8-vs-android5_cnn_music

My Verdict: Android

“Apple has iTunes. Enough said.” — ??

@CnnMoney have you considered that some people might not like iTunes???

I have been producing EDM for 20+ years and iTunes made me stop listening to music. It’s so tedious, especially if you have a lot of music. Back before smartphones absorbed music players, you’d have an MP3 player that was basically a hard drive. You’d just attach it to a computer, drag and drop copy whatever you wanted, and you’re good. iTunes killed that ease of use and my desire to listen to music on my iPhone.

Android works more like the older MP3 players. Nothing to sync – just drag and drop. I like this much better. iTunes is an unfortunate part of Mac-life…

Yeah, so let me emphasize: Android wins because Apple handcuffs you to iTunes. Enough said.

 

EMAIL

ios8-vs-android5_cnn_email

My Verdict: Android

In all fairness, CNN labeled this as “Check Email” and I’m making this about email generally.

On Lollipop, email looks like this:

ios8-vs-android5_lollipop_email1

It’s a beautiful, clean UI. Whether you prefer this over iOS or vice versa – that’s up to you.

Where Android really wins is writing email. Why? Because a) the keyboard crushes the iOS keyboard and b) you can attach files!  Imagine being able to attach a file in 2014!

 

ios8-vs-android5_lollipop_email4

 

GET DIRECTIONS

ios8-vs-android5_cnn_maps

My Verdict: Android

Google Phone running Google Maps. Enough said.

 

CONTACTS

ios8-vs-android5_cnn_contacts

My Verdict: Tie

Though Lollipop finally renames People (which was annoying) to Contacts, and have improved the UI significantly.

ios8-vs-android5_lollipop_contacts

 

SEARCH

ios8-vs-android5_cnn_search

My Verdict: Tie

They both get the job done.

 

SET TO VIBRATE

ios8-vs-android5_cnn_vibrate

My Verdict: iPhone

Yes, the physical button for something this important is much easier.

 

TALK TO PHONE

ios8-vs-android5_cnn_voice_score

My Verdict: Android

OK Google works very accurately for me. Given I’m of South Indian descent, my family and many friends don’t have “American Names”. Siri always struggled with this. OK Google is near flawless for me all the time, and I find it very useful.

In Lollipop this is over the top because now you can OK Google from anywhere – lock screen, any app, etc. Frankly, on Kit Kat I really hated having to go back to the home screen. That’s where iOS’s button press to Siri rocked (which could be done by headset button press). With Google I don’t even have to press a button on my headset. OK Google from anywhere and it works.

ios8-vs-android5_lollipop_voice

 

 

USE FLASHLIGHT

ios8-vs-android5_cnn_flashlight

My Verdict: Hmm

Yep, iOS8 has some nifty capabilities in their control area, available from any screen.

In Lollipop they add similar controls – so I guess this makes it a tie. The only disappointment, the “hmm”, is it would be much cooler if you could modify what controls are in here. Nevertheless,  if we’re talking straight Flashlight – to – Flashlight comparison, well then it’s a tie.

By the way, changing the brightness on Android is a treat. When you start to slide, the entire notification panel fades away except for the brightness slider, and you can preview on your home screen itself what the brightness amount looks like. It’s really coooool!

ios8-vs-android5_lollipop_control

FINAL SCORE

winner: Android 5.0 Lollipop: 7

iOS8: 2

Tie: 2

Meh: 3

Hmm: 1

 

BATTERY LIFE

The one area where iPhone previously crushed all Android Phones was battery life.

No more.

Project Volta on my Nexus lives up to their promise.

Because I travel a lot for work, battery life is critical for me. I was charging my Nexus constantly (5-6 times per day) and now I only charge once in the late late afternoon. Yes, I’m a heavy user.

Sadly, I was considering switching to an iPhone 6 because of the battery life on my Nexus. Now, thank goodness, I don’t have to.

 

LOLLIPOP IMPROVES STOCK APPS

On Kit Kat the stock apps for calendar, SMS (which was Hangouts), phone, and people were not very attractive and had usability issues. As a result, I installed 3rd party apps like DigiCal and EvolveSMS.

On lollipop, Google upgraded the UI / UX of their stock apps for phone, calendar, contacts (formerly called “People”), and introduced Google Messenger as a replacement to Hangouts. These are all fantastically designed and offer a great experience.

After upgrading, I got rid of a lot of 3rd party apps that are no longer needed.

Finally – whatever gap there might have been between Android and iOS is now closed – and the big winner are Android users. It will be interesting to see how Apple responds.

 

FYI I AM AN APPLE FANBOY

Given the above, it might surprise you to know that I’m an Apple Fanboy. I love OS X, have had Mac computers for 10+ years (previously Windows), had iPads 1-3 and iPhones 2-5.

I had a mobile dev company for a while and switched to Android years ago when the Motorola Atrix was the top of the line… And I hated it, within 6 months I was back to iPhone.

About 4 months ago my friend convinced me to give the Nexus 5 a shot. I was complaining about Apple’s lack of innovation (yes, I said it) and iTunes. I also hated that iOS essentially is the same UI / UX as it was when the iPhone 2 was out.  So I got the Nexus and was generally happy. Kit Kat was not as refined, and yes the battery life was not good.

Lollipop is very refined, aesthetically pleasing, and closes any gaps that iOS might have had over Android. It feels like I have a new phone!

Read More

UPDATED: How To Use Python Fuzzy Matching Lists to Clean Large MP3 Library

I’ve been working on cleaning up my mp3 library and wrote a post sharing a script that I created to help me.

Quickly I found this wasn’t a great approach since it was destructive – it would immediately take action on the files (overwriting or deleting matches).

Next, I decided to do the analysis of the files and output to text files that would include filenames and what actions to take. The problem now was the text file became huge, especially because I kept repeating the same filenames in source and targets, etc.

So I decided to use MySQL (locally on my Mac), Python and SQLAlchemy (for efficiency and, frankly, practice).

The idea was to create a master directory of all my music files, storing them in the table ‘directory’.  This table stores the filename and size.

Then I’d have a separate ‘words’ table that breaks out the words in each filename into a list, makes each word lowercase, removes repeat words, and strips out the file extension (.mp3).

In the master directory I have this file name:

Alex Gaudino (feat. Crystal Waters) (Static Shokx remix) – Destination Calabria (Static Shokx Remix).mp3

Which gets put into the words table as this:

shokx,remix,calabria,alex,waters,destination,crystal,static,feat,gaudino

The words table has an columns for id, source (which is the foreign key to the directory table id), and the wordlist (as written above).

Next I wanted to do an analysis comparing each of the wordlists with each other using the fuzzywuzzy module to score the word lists for similarity. The trick here is I’d replace commas with spaces, so the above list becomes:

shokx remix calabria alex waters destination crystal static feat gaudino

and use the token_sort_ratio, because that can compare strings with words in different orders.

If the returned fuzzy score is > 95 (where 100 is a 100% match), then it would compare the two file’s sizes. If the difference is > 1.5Mb, then I don’t want it to overwrite the target — instead I want to keep both. If the difference is < 1.5Mb, then I want to keep the larger file.

I added a ‘ratios’ table with the columns: table id, source id (integer, foreign key into master ‘directory’ table id), target id (integer, foreign key into master ‘directory’ table), ratio score, and file size difference.

Finally, the last table is the ‘act’ table, which stores the action to take via the columns: table id, source id (foreign key to ‘directory’ id), target id (foreign key to ‘directory’ id), todo (what action to take on the source and target files).

One todo option is ‘a2b’, which means copy source (a) to target (b), i.e. rename the target file with the source file name. This effectively deletes the target file. The other todo option is ‘b2a’ which means copy the target (b) to source (a), i.e., rename the source file with the target file name. This effectively deletes the source file. Again, the point is to keep the larger file.

If the above criteria is met, I want to make sure I have not already stored the info in the database, i.e., no duplicate entries.

The tricky part is this

source id = 45, target id=100, and action = ‘a2b’

is the same as

source id = 100, target id = 45, action = ‘b2a’

These are effectively inverses of one another. I only want to store one of these to conserve database space.

[See Step 3 for the full comments, explaining the table and model setup]

Step 1: output a text file with the filename and sizes of all of my MP3 files.

#!/usr/bin/env python

import os

# # path to mp3 folder
path = '/Volumes/1TB_Baby/iTunes'

def get_all_files(path):
	file_list = []
	for dirname, dirnames, filenames in os.walk(path):
	    for subdirname in dirnames:
	        file_list.append(os.path.join(dirname, subdirname))
	    for filename in filenames:
			if '.py' not in filename:
				if 'Store' not in filename:
					file_list.append(os.path.join(dirname, filename))
	return file_list

out = open('/Users/InNov8/Desktop/^all_files_map-FULL-BITCH.txt', 'w')
index = 0

# # writes a header to the text file
header = 'index|size|name\n'

for item in get_all_files(path):
	src = item
	file_split = src.split('/')
	name = file_split[-1]

        # # gets the file size
	size = os.path.getsize(src)

        # write out a delimited file
	out.write(str(index+1) + '|' + str(size) + '|' + name.rstrip() + '\n')
	index += 1

out.close()

 

Step 2: import the file list into the database. This script includes connecting to the database and defines the tables (models)

#!/usr/bin/python

import pymysql
import sqlalchemy
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy import *

engine = sqlalchemy.create_engine('mysql+pymysql://root@127.0.0.1/music?charset=utf8&amp;use_unicode=0', pool_recycle=3600)
connection = engine.connect()

Base = declarative_base(bind=engine)
Base

metadata = MetaData()

# define the tables / models in the database
# i add the ratios and act tables in Step 3 of this post
directory = Table('directory', metadata,
    Column('id', Integer, primary_key=True),
    Column('size', Integer),
    Column('name', String, unique=True),
    )

words = Table('words', metadata,
    Column('id', Integer, primary_key=True),
    Column('source', Integer, ForeignKey('directory.id')),
    Column('words', String),
    )

metadata.create_all(engine)

# add content to database
import os
import sys

path = sys.path[0]
f = open(path + '/^all_files_map-FULL.txt', 'r')
data = f.readlines()

error_counter = 0
success_counter = 0
error = []

dir_ins = directory.insert()

for x in range(1,len(data)):
    chunks= data[x].split('|')
    index = chunks[0]
    size = chunks[1]
    name = chunks[2].rstrip()

    try:
    result = connection.execute(u)
        success_counter += 1
    except Exception, e:
        print index + ' - ERROR'
        error_counter += 1
        error.append(error)

out = open(path +'/importerror.txt')
for item in error:
    out.write(item+'\n')
out.close()

print "SUCCESS: " + int(success_counter)
print "ERROR: " + int(error_counter)

Step 3: This script includes adding data to the words, ratios, and act tables. While these are all in one script, I would uncomment out the code for adding the words first, and comment out the execution part for Ratios and Act tables.

#!/usr/bin/python

import pymysql
import sqlalchemy
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy import *
from sqlalchemy.orm import sessionmaker
from sqlalchemy.sql import select
from difflib import SequenceMatcher as SM
import difflib
from sqlalchemy.sql import exists
from fuzzywuzzy import fuzz

# connect to database
engine = sqlalchemy.create_engine('mysql+pymysql://root@127.0.0.1/music?charset=utf8&amp;use_unicode=0', pool_recycle=3600)
connection = engine.connect()

Base = declarative_base(bind=engine)
Base

metadata = MetaData()

# the database tables / models, adds the act and ratios tables
directory = Table('directory', metadata,
    Column('id', Integer, primary_key=True),
    Column('size', Integer),
    Column('name', String, unique=True),
    )

words = Table('words', metadata,
    Column('id', Integer, primary_key=True),
    Column('source', Integer, ForeignKey('directory.id')),
    Column('wordlist', String),
    )

ratios = Table('ratios', metadata,
    Column('id', Integer, primary_key=True),
    Column('source', Integer, ForeignKey('directory.id')),
    Column('target', Integer, ForeignKey('directory.id')),
    Column('ratio', Float),
    Column('difference', Integer)
    )

act = Table('act', metadata,
    Column('id', Integer, primary_key=True),
    Column('source', Integer, ForeignKey('directory.id')),
    Column('target', Integer, ForeignKey('directory.id')),
    Column('todo', String(20)),
    )

metadata.create_all(engine)

# ---------------------------------------------------------------------
# Functions
# ---------------------------------------------------------------------

# cleans the filename punctuation, removes duplicate words in the name
# returns the name as a list
def createWordList(line):
    wordList2 =[]
    wordList1 = line.split()
    for word in wordList1:
        cleanWord = ""
        for char in word:
            if char in '!,.?":;[]()_-#@^&amp;{}':
                char = ""
            cleanWord += char
        wordList2.append(cleanWord.lower())

    for word in wordList2:
        if len(word) == 0:
            wordList2.remove(word)
        finalWordList = set(wordList2)
    return finalWordList

# ---------------------------------------------------------------------
# ADD DATA TO WORDS TABLE
# store the filename unique words into the words table
# ---------------------------------------------------------------------

# # UNCOMMENT BELOW IF YOU WANT TO CONVERT THE FILES IN THE DIRECTORY
# # TABLE INTO WORDS AND STORE THEM IN THE WORDS TABLE
# src = directory.select()
# src_result = connection.execute(src)
# src_id = 0
# src_name = ''
# src_size = ''

# for row in src_result:
#     src_id = row[0]
#     src_size = row[1]
#     src_name = row[2].replace('.mp3','')

#     wordsplit = createWordList(src_name)

#     out_list = []
#     for w in wordsplit:
#         out_list.append(w)
#     out = ','.join(out_list)
#     print out

#     i = words.insert().values(source=src_id, wordlist=out)
#     rsi = connection.execute(i)

# ---------------------------------------------------------------------
# ADD DATA TO RATIOS AND ACT TABLES
# This compares the filename words for fuzzy match
# checks the size difference between the matching files
# if difference &gt; 1.5Mb then skip
# else write to database
# ---------------------------------------------------------------------

# # create a dictionary to store the song data
# # use the dictionary data structure to ensure uniques
song = {}

# # select rows from the words table
# # limit is number of records, offset is how many initially to skip
# # change the limit and offset values
src = words.select().limit(1000).offset(0)
result = connection.execute(src)

for r in result:
    source = r[1] 
    words = r[2]

    # # collect additional information from the directory table
    s = directory.select(directory.c.id==source)
    rst = connection.execute(s)

    for x in rst:
        size = x[1]
        name = x[2]

    # store the source id from words table (foreign key into directory table), words, size, and filename
    song[source] = [ words, size, name]

# # create list to store data to be written to the database
write = []

# # iterate the song dictionary, select one at a time, source song
for k,v in song.items():

    # # iterate through the song dictionary, this will compare the initial
    # # selection k (key) with each and every other song, target song
    for a,b in song.items():

        # # if the source song is not the same as the target song
        # # we don't want to compare a song to itself!
        if k != a:
            # # remove the commas from the word lists
            w1 = song[k][0].replace(',',' ')
            w2 = song[a][0].replace(',',' ')

            # # fuzzy match calculate ratios between two filename word list
            ratio = fuzz.token_sort_ratio(w1, w2)

            print ratio, ' | ', w1, ' | ', w2

            # # change the value below to change the fuzzy ratio threshold
            # # the further away from 100 (which is 100% match) the 
            # # the 'fuzzier' it will be, i.e., less precise
            if ratio >= 95:
                print '\tRATIO: ' + str(ratio)

                # # calculate absolute value of the size difference
                d = abs(song[k][1] - song[a][1])

                # # change the value below if you want it to be different
                # # from 1.5Mb
                if d >= 1500000:
                    print '\tSize Diff: Too Large [', d, ']'
                    action = 'skip'
                    print '\tACTION: ', action
                else:
                    print '\tSize Diff: ', d
                    if song[k][1] >= song[a][1]:
                        larger = k
                        sng = 'SONG A' 
                        action = 'a2b'
                    else:
                        larger = a
                        sng = 'SONG B'
                        action = 'b2a'

                # # if the file is eligible to be written to database store in write list
                if [k, a, ratio, d, action] not in write and [a, k, ratio, d, action] not in write:
                    write.append([k,a,ratio,d,action])
                    print '\tLarger: ', sng, ' - id:', larger
                    print '\tACTION: ', action, '\n\n'


# # store everything in the database into a list
stored = []

# # query the act table and get all the results
s = act.select()
rs = connection.execute(s)

# # the results come back in this form
# # (94, 1236, 1237, 'b2a')
# # index[0] = table id
# # index[1] = source filename id relates to directory table
# # index[2] = target filename id relates to directory table
# # index[3] = the action, a2b = rename b with a name OR 
# #                        b2a = rename a with b
# # pop the index[0], remove the id to get the raw source, target, action
# # append to stored list
for r in rs:
    c = list(r)
    c.pop(0)
    stored.append(c)

# # this prints out what is stored, for reference, not necessary
print '--------'
print stored
print '^^^^^^^^'


# # iterate through the two write list
# # syntax is [1746, 1732, 100, 440043, 'a2b']
# # index[0] = source id refers to directory table
# # index[1] = target id refers to directory table
# # index[2] = fuzzy ratio
# # index[3] = file size difference
# # index[4] = action

for i in write:
        # # print out the info to write
        print i
        # # make a copy of the item to a new list
        w = i[:]

        # # pop the ratio and the file size difference (i.e., remove these from list)
        # # this makes the syntax the same as the items in stored list
        # # which allows you to compare the to write with what's in the stored list
        w.pop(2)
        w.pop(2)

        # if the item is in stored, then remove it from the to write list
        if w in stored:
            write.remove(i)

        # # if the action is deemed as skip then skip to the next loop
        # # take no action with this item
        else:
            if i[-1] == 'skip':
                continue

            # # if everything is fine then we write the item to the database
            # # we write to the ratio table which is the source id, target id,
            # #    ratio, and file size difference
            # # we write to the action table which is the source id, target id, action
            # # append the the item to the stored list
            # # append the inverse to the stored list
            # # e.g., if source id = 45, target id=100, and action = 'a2b'
            # # this is the sme as source id = 100, target id = 45, action = 'b2a'
            # # we do not want to create duplicate / inverse entries
            # # by appending both to stored list, it will skip every occurance and its
            # # inverse, thus not duplicating records
            try:
                r = ratios.insert().values(source=i[0], target=i[1], ratio=i[2], difference=i[3])
                rs = connection.execute(r)

                a = act.insert().values(source=i[0], target=i[1], todo=i[4])
                ra = connection.execute(a)
                stored.append([i[0],i[1],i[4]])
                stored.append([i[1],i[0],i[4][::-1]])
                print 'INSERTED'
            except Exception, e:
                r = ratios.update().where(ratios.c.source==i[0], ratios.c.target==i[1]).values(source=i[0], target=i[1], ratio=i[2], difference=i[3])
                rs = connection.execute(r)

                a = act.insert().where(act.c.source==i[0], act.c.target==i[1]).values(source=i[0], target=i[1], todo=i[4])
                ra = connection.execute(a)
                print 'UPDATED'


There you have it.

The next thing I will do is write a script to grab info from the act table (for source, target, and what to do) and then grab the actual filenames from the directory table, and execute.

The directory table is the only one that contains the filenames (unique, occurring only once), and all other tables access the filenames via foreign keys. Searching for integers is a lot faster than searching for strings. This reduces how much info I need to store in the database and makes things pretty fast.

Let me know if you have any questions!

 

Here is code for SQLAlchemy query syntax, just an fyi:

# ---------------------------------------------------------------------
# QUERY SYNTAX EXAMPLES - FOR REFERENCE
# These for reference re: various query syntax
# ---------------------------------------------------------------------

# # SELECT
# s = select([Directory]).where(Directory.name.like("%\smeta%"))

# s = directory.select(directory.c.id==1)
# rs = connection.execute(s)

# for r in rs:
#     print r

# # INSERT
# i = words.insert().values(source=1, wordlist='blah')
# rsi = connection.execute(i)

# # UPDATE
# a = act.update().where(act.c.id==1).values(source=1, target=2, todo='darn')
# rs = connection.execute(a)

Read More

How To Use Python Fuzzy Matching Lists to Clean Large MP3 Library

I have a large library of MP3s (> 100Gb). Many of these are the same song but may be named differently, have different encodings, or for whatever slight reason that causes a de-duplication utility to not catch them.

For example:
Bit Lolitas – Nicki Trax (Prog House) – Murder Weapon (Original Mix).mp3 [24.1Mb]
Bit Lolitas – Nicki Trax Prog House – Murder Weapon (Original Mix).mp3 [24.2Mb]

What I’m about to tell you is dangerous: if you’re not careful, you can end up deleting a lot of files with no undo.

The algorithm

  1. Read all the music filenames in the folder (filter out the .py script and .DS_Store)
  2. Split each filename into a list of lowercase words, remove the spaces from the list
  3. Compare each filename word list with the word list of every other filename and score for similarity, return matches if the similarity is a certain ratio or higher. A ratio result of 1.0 means the filename word lists are the same
  4. Set two thresholds: if the ratio is greater then or equal to X, then delete the smaller file in the match, if the ratio is within a range Y – Z then give me options for manual intervention

One of the key risks is the mp3s might be different version – like an extended mix vs. shorter radio edit. To handle this, I also compare file sizes of files that come back with similar word lists. If the difference exceeds 1.5Mb then I skip – i.e., don’t rename, don’t delete, etc.

#!/usr/bin/env python

import os
import re
import sys
from difflib import SequenceMatcher as SM
import difflib
import time

'''
This was written to clean a large library of MP3 files. The files were run through 
a de-duplication utility, however it did not catch them all. 
'''

# set the path to the folder of the files
path = '/Volumes/1TB_Baby/iTunes'

The get_all_files function stores all of the filenames in the folder in a list. It filters out .py and .DS_store files:

def get_all_files(path):
	file_list = []
	for dirname, dirnames, filenames in os.walk(path):
	    for subdirname in dirnames:
	        file_list.append(os.path.join(dirname, subdirname))
	    for filename in filenames:
			if '.py' not in filename:
				if 'Store' not in filename:
					file_list.append(os.path.join(dirname, filename))
	return file_list

The cleanWordList function stores all the words in a filename in a list, converts to lowercase and remove punctuation:

def createWordList(line):
    wordList2 =[]
    wordList1 = line.split()
    for word in wordList1:
        cleanWord = ""
        for char in word:
            if char in '!,.?":;[]()_-#@^':
                char = ""
            cleanWord += char
        wordList2.append(cleanWord.lower())
    return wordList2

The compare function compares the files and returns their sizes as well as which one is larger:

def compare(src, dst):
	try: 
		src_size = os.path.getsize(src)
		dst_size = os.path.getsize(dst)

		if src_size &gt;= dst_size:
			larger = 'src'
		else:
			larger = 'dst'
		return src_size, dst_size, larger
	except Exception, e:
		pass

The remove function deletes the file. If you want to implement as a dry run, comment out os.remove(f).

def remove(f):
	try:
		# comment the os.remove(f) if you want to do a dry run
                os.remove(f)
		print "DELETED: " + f
	except Exception, e:
		pass

The difference function returns the absolute value of the difference of file sizes. It also assigns a value to the “action” variable, which will later tell the script to skip executing on a file if the files are > 1.5Mb. You can change the size threshold by altering

if difference >= 1500000:

1.5Mb = 1500000

def difference(src_size, dst_size):

	action = ''
	difference = abs(src_size - dst_size)

	# SET THE DIFFERENCE VALUE
	# SKIP RENAME IF THE TWO FILES ARE DIFFERENT BY THE FILESIZE BELOW
	# INITIAL IS SET TO 1.5 MB OR 1500000

	if difference &gt;= 1500000:
		message =  '\tDIFFERENCE &gt; 1.5 MB ---[ ' + str(difference) + ' ]----'
		action = 'skip'
	else:
		message =  '\tDIFFERENCE ------------[ ' + str(difference) + ' ]----'
		action = ''

	print message
	print '\tACTION: ' + action

	return action

This is the function that executes the rename or deletion for the matches in the manual intervention range.

You can pick to keep the source [src a.k.a. file 1] or the destination [dst a.k.a. file 2] or rename them.

If you select rename, you can rename the source > destination or the destination > source, effectively overwriting the target named file.

The purpose of this is if say the script tells you that the dst file is larger and should be kept, but you like how the src file itself is named. So you rename the dst file with the src name, thus keeping the dst and overwriting the src file.

def execute(src, dst, source, go):
	try:
		src_size, dst_size, larger = compare(src, dst)

		if go == 'y':
			if larger == 'src':
				print 'KEEP: ' + dst
				remove(src)
				del source[src]
			else:
				print 'KEEP: ' + src
				remove(dst)
				del source[dst]

		elif go == 's':
				print 'KEEP: ' + src
				remove(dst)
				del source[dst]

		elif go == 'd':
				print 'KEEP: ' + dst
				remove(src)
				del source[src]

		elif go == 'rename src to dst':
			print 'Renamed src to: ' + dst
			os.rename(src,dst)
			del source[src]
			del source[dst]

		elif go == 'rename dst to src':
			print 'Renamed dst to: ' + src
			os.rename(dst,src)
			del source[src]
			del source[dst]

		else:
			print 'TEST RUN: '

		print '--------------\n'
		return source
	except Exception, e:
		pass

The main body of the script

# dict to hold source and chunked items
source = {}


# get all of the files in the path
# split them into words
# put them into a dict

for item in get_all_files(path):
	# assign the path + filename to src
	src = item

	# isolate file (remove the path) and strip whitespace and "mp3"
	file_split = src.split('/')
	f = file_split[-1]
	f = f.replace('.mp3', '')
	
	# turn into list of words
	# call the function to do this
	chunks = createWordList(f)

	# remove spaces from the list
	for item in chunks:
		if len(item) == 0:
			chunks.remove(item)

	# we store the initial path + file as the key and 
	# the word list as the value
	# in a dictionary
	source[src] = chunks



# we are going to iterate through the dict in two loops
# this loop / iteration selects 1 file from the dict

for k, v in source.items():
	# not necessary but helpful to see this is working
	print 'source item: ' + k

# this is the inner loop, which then compares the file selected above
# with every other file in the dictionary

	for a, b in source.items():
		# if the two initial and the comparison files are not the same
		# then move forward with the comparison
		if k != a:

			try: 

				# assign the path + filenames to src and dst variables
				src, dst = k, a
				
				# assign the wordlists of both files to lists
				L_1 = source[k]
				L_2 = source[a]

				# calculate the similarity ratio between the two word lists
				sm=difflib.SequenceMatcher(None,L_1,L_2)
				ratio = sm.ratio()
				
				if ratio &gt; 0:
					print '-[LIST DIFF RATIO]-[ ' + str(ratio) +' ]----'

				# THE FOLLOWING WILL CAUSE AN AUTOMATIC REMOVAL OF A DUPLICATE
				# IF THE MATCH EXCEEDS THE FOLLOWING RATIO
				# BE CAREFUL WITH THIS

				if ratio &gt;= 0.95:
					execute(src, dst, source, go='y')
					time.sleep(2)


				# UNCOMMENT THESE FOR MANUAL INTERVENTION
				# SET THE RANGE TO YOUR PREFERENCE OF WHAT YOU'RE LOOKING FOR

				# elif ratio &gt; 0.90 and ratio &lt; 0.98:
					
				# 	src_size, dst_size, larger = compare(src, dst)
				# 	print 'source: ' + src, src_size
				# 	print 'destin: ' + dst, dst_size
				# 	print '\n'
				# 	print '\tlarger: ' + larger

				# 	action = difference(src_size, dst_size)

This step is how the script skips taking action on any similar files if the file size is too great, which is set in the difference function above.

				# 	if action == 'skip':
				# 		print 'SKIPPING TO NEXT FILE'
				# 		continue

				# 	go = raw_input('Keep [s]rc OR [d]st or [r]ename: ')

				# 	go = go.rstrip()
				# 	go = go.lower()

While not listed as an option in the above dialog, instead of picking [s] [d] or [r] as per above, the user can hit return to skip forward or type in “exit” to end the script.

				# 	if go == '':
				# 		continue

				# 	if go == 'exit':
				# 		print 'EXITING SCRIPT'
				# 		exit()

If the user selects [r] to rename, then they are asked whether to rename the source > destination or the destination > source filename.

				# 	elif go == 'r':
				# 		go_rename = raw_input('Rename [s]rc -&gt; dst OR [d]st -&gt; src: ')
				# 		go_rename = go_rename.strip()
				# 		go_rename = go_rename.lower()
				# 		if go_rename == 's':
				# 			go = 'rename src to dst'
				# 		elif go_rename == 'd':
				# 			go = 'rename dst to src'
				# 		elif go == '':
				# 			continue
				# 		else:
				# 			if go == 'exit':
				# 				print 'EXITING SCRIPT'
				# 				exit()

This is the function that executes the renaming and removal:

				# 	execute(src, dst, source, go)

I often like to put visual spacers / separators:

				# print '---------------\n '

			except Exception, e:
				continue

Here is the full code:

#!/usr/bin/env python

import os
import re
import sys
from difflib import SequenceMatcher as SM
import difflib
import time

'''
This was written to clean a large library of MP3 files. The files were run through 
a de-duplication utility, however it did not catch them all. 
'''

path = '/Volumes/1TB_Baby/iTunes'

def get_all_files(path):
	file_list = []
	for dirname, dirnames, filenames in os.walk(path):
	    for subdirname in dirnames:
	        file_list.append(os.path.join(dirname, subdirname))
	    for filename in filenames:
			if '.py' not in filename:
				if 'Store' not in filename:
					file_list.append(os.path.join(dirname, filename))
	return file_list


def createWordList(line):
    wordList2 =[]
    wordList1 = line.split()
    for word in wordList1:
        cleanWord = ""
        for char in word:
            if char in '!,.?":;[]()_-#@^':
                char = ""
            cleanWord += char
        wordList2.append(cleanWord.lower())
    return wordList2


def compare(src, dst):
	try: 
		src_size = os.path.getsize(src)
		dst_size = os.path.getsize(dst)

		if src_size &gt;= dst_size:
			larger = 'src'
		else:
			larger = 'dst'
		return src_size, dst_size, larger
	except Exception, e:
		pass


def remove(f):
	try:
		os.remove(f)
		print "DELETED: " + f
	except Exception, e:
		pass


def difference(src_size, dst_size):

	action = ''
	difference = abs(src_size - dst_size)

	# SET THE DIFFERENCE VALUE
	# SKIP RENAME IF THE TWO FILES ARE DIFFERENT BY THE FILESIZE BELOW
	# INITIAL IS SET TO 1.5 MB OR 1500000

	if difference &gt;= 1500000:
		message =  '\tDIFFERENCE &gt; 1.5 MB ---[ ' + str(difference) + ' ]----'
		action = 'skip'
	else:
		message =  '\tDIFFERENCE ------------[ ' + str(difference) + ' ]----'
		action = ''

	print message
	print '\tACTION: ' + action

	return action



def execute(src, dst, source, go):
	try:
		src_size, dst_size, larger = compare(src, dst)



		if go == 'y':
			if larger == 'src':
				print 'KEEP: ' + dst
				remove(src)
				del source[src]
			else:
				print 'KEEP: ' + src
				remove(dst)
				del source[dst]

		elif go == 's':
				print 'KEEP: ' + src
				remove(dst)
				del source[dst]

		elif go == 'd':
				print 'KEEP: ' + dst
				remove(src)
				del source[src]

		elif go == 'rename src to dst':
			print 'Renamed src to: ' + dst
			os.rename(src,dst)
			del source[src]
			del source[dst]

		elif go == 'rename dst to src':
			print 'Renamed dst to: ' + src
			os.rename(dst,src)
			del source[src]
			del source[dst]

		else:
			print 'TEST RUN: '

		print '--------------\n'
		return source
	except Exception, e:
		pass



# dict to hold source and chunked items
source = {}


# get all of the files in the path
# split them into words
# put them into a dict

for item in get_all_files(path):
	# assign the path + filename to src
	src = item

	# isolate file (remove the path) and strip whitespace and "mp3"
	file_split = src.split('/')
	f = file_split[-1]
	f = f.replace('.mp3', '')
	
	# turn into list of words
	# call the function to do this
	chunks = createWordList(f)

	# remove spaces from the list
	for item in chunks:
		if len(item) == 0:
			chunks.remove(item)

	# we store the initial path + file as the key and 
	# the word list as the value
	# in a dictionary
	source[src] = chunks



# we are going to iterate through the dict in two loops
# this loop / iteration selects 1 file from the dict

for k, v in source.items():
	# not necessary but helpful to see this is working
	print 'source item: ' + k

# this is the inner loop, which then compares the file selected above
# with every other file in the dictionary

	for a, b in source.items():
		# if the two initial and the comparison files are not the same
		# then move forward with the comparison
		if k != a:

			try: 

				# assign the path + filenames to src and dst variables
				src, dst = k, a
				
				# assign the wordlists of both files to lists
				L_1 = source[k]
				L_2 = source[a]

				# calculate the similarity ratio between the two word lists
				sm=difflib.SequenceMatcher(None,L_1,L_2)
				ratio = sm.ratio()
				
				if ratio &gt; 0:
					print '-[LIST DIFF RATIO]-[ ' + str(ratio) +' ]----'

				# THE FOLLOWING WILL CAUSE AN AUTOMATIC REMOVAL OF A DUPLICATE
				# IF THE MATCH EXCEEDS THE FOLLOWING RATIO
				# BE CAREFUL WITH THIS

				if ratio &gt;= 0.95:
					execute(src, dst, source, go='y')
					time.sleep(2)


				# UNCOMMENT THESE FOR MANUAL INTERVENTION
				# SET THE RANGE TO YOUR PREFERENCE OF WHAT YOU'RE LOOKING FOR

				# elif ratio &gt; 0.90 and ratio &lt; 0.98: 					 				# 	src_size, dst_size, larger = compare(src, dst) 				# 	print 'source: ' + src, src_size 				# 	print 'destin: ' + dst, dst_size 				# 	print '\n' 				# 	print '\tlarger: ' + larger 				# 	action = difference(src_size, dst_size) 				# 	if action == 'skip': 				# 		print 'SKIPPING TO NEXT FILE' 				# 		continue 				# 	go = raw_input('Keep [s]rc OR [d]st or [r]ename: ') 				# 	go = go.rstrip() 				# 	go = go.lower() 				# 	if go == '': 				# 		continue 				# 	if go == 'exit': 				# 		print 'EXITING SCRIPT' 				# 		exit() 				# 	elif go == 'r': 				# 		go_rename = raw_input('Rename [s]rc -&gt; dst OR [d]st -&gt; src: ')
				# 		go_rename = go_rename.strip()
				# 		go_rename = go_rename.lower()
				# 		if go_rename == 's':
				# 			go = 'rename src to dst'
				# 		elif go_rename == 'd':
				# 			go = 'rename dst to src'
				# 		elif go == '':
				# 			continue
				# 		else:
				# 			if go == 'exit':
				# 				print 'EXITING SCRIPT'
				# 				exit()


				# 	execute(src, dst, source, go)

				# print '---------------\n '

			except Exception, e:
				continue

Read More

Synchronize Files Folders and Drives with Python

I’ve been using Chronosync for years, but a recent (painful) experience with my Seagate NAS 220 led me to look elsewhere. I came across the following script on the web — which I cannot take credit for authoring. It is exceptionally useful and fast, which I’ve been using for synchronizing external drives.

#coding=utf-8
# This is a small python script to help you synchronize two folders
# Compatible with Python 2.x and Python 3.x
 
#!/usr/bin/env python

import filecmp, shutil, os, sys
 
SRC = r'/path/to/source/file/or/folder'
DEST = r'/path/to/destination/file/or/folder'
 
IGNORE = ['Thumbs.db', '.DS_Store']
 
def get_cmp_paths(dir_cmp, filenames):
    return ((os.path.join(dir_cmp.left, f), os.path.join(dir_cmp.right, f)) for f in filenames)
 
def sync(dir_cmp):
    for f_left, f_right in get_cmp_paths(dir_cmp, dir_cmp.right_only):
        if os.path.isfile(f_right):
            os.remove(f_right)
        else:
            shutil.rmtree(f_right)
        print('delete %s' % f_right)  
    for f_left, f_right in get_cmp_paths(dir_cmp, dir_cmp.left_only+dir_cmp.diff_files):
        if os.path.isfile(f_left):
            shutil.copy2(f_left, f_right)
        else:
            shutil.copytree(f_left, f_right)
        print('copy %s' % f_left)
    for sub_cmp_dir in dir_cmp.subdirs.values():
        sync(sub_cmp_dir)
 
def sync_files(src, dest, ignore=IGNORE):
    if not os.path.exists(src):
        print('= =b Please check the source directory exists')
        print('- -b Sync file failure !!!')
        return
    if os.path.isfile(src):
        print('#_# We only support for sync directory but not a single file,one file please do it by yourself')
        print('- -b Sync file failure !!!')
        return
    if not os.path.exists(dest):
        os.makedirs(dest)    
    dir_cmp = filecmp.dircmp(src, dest, ignore=IGNORE)
    sync(dir_cmp)
    print('^_^ Sync file finished!')
 
if __name__ == '__main__':
    src, dest = SRC, DEST
    if len(sys.argv) == 3:
        src, dest = sys.argv[1:3]
    sync_files(src, dest)

Read More

Using Regex to Parse Text on Multiple Lines

Regular Expressions, or Regex, is a really powerful tool. Don’t quote me, but I believe it’s a full on language that’s inside all the other languages.

Regex is useful for parsing text, which is especially handy when you’re data mining.

Unfortunately regex is really intimidating to learn. I’ve never done it enough to get proficient with it — always learning just enough to accomplish something pressing. Someday, however, I’d love to go deep into regex.

For this current project I ran into 2 problems:

  1. How do I zoom into the specific text I’m searching for?
  2. How do I use regex to match across multiple lines?

Let’s say I’m trying to extract the employees (number) from the HTML below:

 

<div id=”detailsContainer”>

<div class=”detailsDataContainerLt”>

<div itemprop=”url”>

<a class=”link_sb_blue bold” href=”http://www.somesite.com” target=”_blank”>

www.somesite.com

</a>

</div>

<div>

<strong>

17,600

</strong>

Employees

<br/>

Here is the regex that works:

e = re.compile('class="detailsDataContainerLt".*strong&gt;(.*?)&lt;/strong.*Employees', re.DOTALL)
emp = e.findall(str(soup))

First thing is we create a regex pattern, the re.compile part above. The trick to isolating the the text I want to extract is to use the .* symbols. What I’m saying here is start with class=”detailsDataContainerLt” then capture everything from here through the end of strong>. 

The text I want to grab is what happens next, which I symbolize as (.*?) and I stop grabbing when I get to </strong and everything in between until I get to Employees. 

This approach accomplishes #1 on my list above.

Now to make this work across multiple lines, I have to include the re.DOTALL.

The next line then executes the search:

emp = e.findall(str(soup))

Which I can then output via

print emp

And there you have it!

Read More

Python – Append write to a file

I was recently doing some data mining and came across this tip to append to an existing file.

In this first section I test to see if the file exists (via the “if os.path.isfile(file_out_people) != True:” line). If it doesn’t exist, then I create the file and write the header string. This section can be used at the top of a script, to create the file in which you’ll write more content to later.

#!/usr/bin/env python

import os
import sys
PP_header = "company|ticker|person_id|name|title|url\n"

# test to see if file exists, if not then create it
file_out_people = path + '/' + company + '-people.txt'
if os.path.isfile(file_out_people) != True:
    with open(file_out_people, "w") as output:
        output.write(PP_header)

In this block, you load the existing file and read it line by line. Then you write to an output file with the same name as the input file, first writing out the lines you read in. Then you write the new line to the end.

with open(file_out_people) as input:
# Read non-empty lines from input file
    lines = [line for line in input if line.strip()]

with open(file_out_people, "w") as output:
    for line in lines:
        output.write(line)
    output.write(company_write + '|' + ticker + '|' + str(person_id[x]) + '|' + name[x] + '|' + exec_title[x] + '|' + person_url + '\n')

Pros: This process automatically does the output.close() so you don’t have to include it. It’s also great for incrementally adding content to a file via appending it. If you’re doing a big script and a final write at the end — should anything happen to cause it to fail — you’ll end up with a blank output file. This method, however, writes as you go.

Con: It writes as you go! For each iteration it reads in the existing file, writes it all out again, and then adds the new data to the end. In some ways this is inefficient.

It’s probably better to store all the output in a list, and then upon end or failure, to dump that to a file one time.

That said, this is handy and does what I need it to do right now.

Read More