Latent Semantic Analysis in Haskell

Last weekend I decided to learn Haskell with Francis Williams. It was an epic adventure. Last night I implemented LSA – Latent Semantic Analysis using Haskell.

For those of you who are not enlightened: LSA takes a list of documents, extracts terms from them, builds a tf-idf matrix, then compresses the matrix to build a lower dimensional approximation, to improve the speed and accuracy of measuring the similarity between documents. Sometimes, this is used in combination with clustering to group the documents into categories.

I want to point anyone interested in this towards the following resources that helped me do this:

In case anyone wants to do something similar themselves, here is the code I ended up with. Please let me know if I’m doing something stupid and I’ll fix this up. I suspect that I am, considering how new I am to Haskell.

import Data.List
import Data.Char
import Numeric.LinearAlgebra

titles = ["The Neatest Little Guide to Stock Market Investing",
"Investing For Dummies, 4th Edition",
"The Little Book of Common Sense Investing: The Only Way to Guarantee Your Fair Share of Stock Market Returns",
"The Little Book of Value Investing",
"Value Investing: From Graham to Buffett and Beyond",
"Rich Dad's Guide to Investing: What the Rich Invest in, That the Poor and the Middle Class Do Not!",
"Investing in Real Estate, 5th Edition",
"Stock Investing For Dummies",
"Rich Dad's Advisors: The ABC's of Real Estate Investing: The Secrets of Finding Hidden Profits Most Investors Miss" ]

stopwords = ["and","edition","for","in","little","of","the","to"]

text = unlines titles

docs :: [[String]]
docs = map (filter (not . (`elem` stopwords))) \$      -- stopwords filter
map words \$
lines \$
filter (\x -> isAlpha x || isSpace x) \$   -- discard everything except alpha and space characters
map toLower text                        -- lowercase the input

tf :: [([Char], Int)]
tf = filter (\(_,f) -> f>1) \$ map (\l@(x:xs) -> (x,length l)) . group . sort \$ concat docs -- remove words that appear only once

doc_freq :: Int -> [Char] -> Int
doc_freq d t = length (filter (==t) (docs !! d))

mat :: Matrix Double
mat = buildMatrix (length tf) (length docs) ( \(term, doc) ->
let occurances = fromIntegral \$ doc_freq doc \$ fst \$ tf !! term -- occurance count
docLength = genericLength \$ docs !! doc                     -- words per doc
numDocs = genericLength docs                                -- number of docs
commonness = fromIntegral \$ snd \$ tf !! term                -- number of docs this word occurs in
in (occurances / docLength * log (numDocs / commonness))
)

compress k m = u_k  sigma_k  v_k where
(u,sigma,v) = fullSVD m                         -- get SVD
sigma_k = (takeColumns k . takeRows k) sigma    -- keep k values of Σ
u_k = takeColumns k u                           -- keep k columns of U
v_k = takeRows k \$ trans v                      -- keep k rows of v

reduce_dim k m = v_k where
(u,sigma,v) = fullSVD m                         -- mapping of documents to concept space
v_k = takeRows k \$ trans v                      -- keep k rows of v