Latent Semantic Analysis in Haskell

Last weekend I decided to learn Haskell with Francis Williams. It was an epic adventure. Last night I implemented LSA – Latent Semantic Analysis using Haskell.

For those of you who are not enlightened: LSA takes a list of documents, extracts terms from them, builds a tf-idf matrix, then compresses the matrix to build a lower dimensional approximation, to improve the speed and accuracy of measuring the similarity between documents. Sometimes, this is used in combination with clustering to group the documents into categories.

I want to point anyone interested in this towards the following resources that helped me do this:

In case anyone wants to do something similar themselves, here is the code I ended up with. Please let me know if I’m doing something stupid and I’ll fix this up. I suspect that I am, considering how new I am to Haskell.

import Data.List
import Data.Char
import Numeric.LinearAlgebra

titles = ["The Neatest Little Guide to Stock Market Investing",
          "Investing For Dummies, 4th Edition",
          "The Little Book of Common Sense Investing: The Only Way to Guarantee Your Fair Share of Stock Market Returns",
          "The Little Book of Value Investing",
          "Value Investing: From Graham to Buffett and Beyond",
          "Rich Dad's Guide to Investing: What the Rich Invest in, That the Poor and the Middle Class Do Not!",
          "Investing in Real Estate, 5th Edition",
          "Stock Investing For Dummies",
          "Rich Dad's Advisors: The ABC's of Real Estate Investing: The Secrets of Finding Hidden Profits Most Investors Miss" ]

stopwords = ["and","edition","for","in","little","of","the","to"]

text = unlines titles

docs :: [[String]]
docs = map (filter (not . (`elem` stopwords))) $      -- stopwords filter
        map words $
          lines $
            filter (\x -> isAlpha x || isSpace x) $   -- discard everything except alpha and space characters
              map toLower text                        -- lowercase the input

tf :: [([Char], Int)]
tf = filter (\(_,f) -> f>1) $ map (\l@(x:xs) -> (x,length l)) . group . sort $ concat docs -- remove words that appear only once

doc_freq :: Int -> [Char] -> Int
doc_freq d t = length (filter (==t) (docs !! d))

mat :: Matrix Double
mat = buildMatrix (length tf) (length docs) ( \(term, doc) ->
          let occurances = fromIntegral $ doc_freq doc $ fst $ tf !! term -- occurance count
              docLength = genericLength $ docs !! doc                     -- words per doc
              numDocs = genericLength docs                                -- number of docs
              commonness = fromIntegral $ snd $ tf !! term                -- number of docs this word occurs in
          in (occurances / docLength * log (numDocs / commonness))
      )

compress k m = u_k  sigma_k  v_k where
    (u,sigma,v) = fullSVD m                         -- get SVD
    sigma_k = (takeColumns k . takeRows k) sigma    -- keep k values of Σ
    u_k = takeColumns k u                           -- keep k columns of U
    v_k = takeRows k $ trans v                      -- keep k rows of v

reduce_dim k m = v_k where
        (u,sigma,v) = fullSVD m                         -- mapping of documents to concept space
        v_k = takeRows k $ trans v                      -- keep k rows of v