I want you to find a poem you love.
This project comes from a very human desire: to share something I find meaningful. Poems are hard to love, but when one makes it into your soul it can be the most powerful thing. How do we find poems to love?
This is my idea: you might have one or two poems you really love. You found them at some point in your life and they just clicked. If I could give you one more poem you love, you can sit with it and take it into your heart and love it. One more poem! Amazing. And then, maybe, one more poem you love. Slowly you build a larger library of poems. Eventually it becomes easier to love new ones.
The Poetry Recommendation Engine takes one poem and tries to give you one other poem that can touch you. You can take it from there.
This is how the engine works.
The Poetry Recommendation Engine currently looks at a set of computer-generated features that describe your poem and finds another poem in the database with the most similar set of features. "Most similar" means it finds the poem with the smallest sum of the square of differences for each feature.
These features describe the look of the poem on the page. They are easy to calculate so I started with these.
Number of Words Pretty self-explanatory.
Number of Lines Yup.
Width in Characters This is the average line length in characters. This gives a sense of how wide the poem sits on the page.
Average Word Size This is the average size of a word in characters. The idea here is to get a feel for the types of words in the poem -- lots of long words or mostly short ones? Technically this could count as a vocabulary feature but it's calculated like a size one.
These features try to describe the vocabulary the poem uses.
RepetitionScore A score of 0 means no word or stem of word is repeated. A score of 1 means every single word is exactly the same. It's calculated as follows: Each word in the poem contributes 1 if it is repeated exactly elsewhere in the poem, 0.5 if its stem (using the Snowball Stemmer) is repeated elsewhere, or 0 if not. Then the score is divided by the number of words in the poem.
Obscurity Score A score of 0 means every word is found in the top 1000 used English words according to WordFrequency.Info. Each word not found in the top 1000 contributes 1; the score is them divided by the number of words in the poem.
These features try to describe some aspects of the linguistics a poem uses.
SentenceScore +1 For each sentence used, where a sentence is a string of text ending with a period. +1 For each sentence starting with a capital letter. +1 For each sentence that ends with line break. Divide score by number of characters in poem. Multiply by ten to get larger numbers..
Previous incarnations to haunt me.
What's Wrong With Topic Modeling?
My first iteration of the this project used topic modeling, specifically an algorithm called latent dirichlet allocation (LDA). My favorite layman's explanation of LDA is by Edwin Chen on his blog. My favorite technical explanation of LDA is by Prof. David Blei in this lecture. Basically LDA turns each poem into a bag of words: 12 cases of 'you', 3 cases of 'flower', etc. Then it finds clusters of words that commonly occur in poems together. These are considered the topics. Each poem has a distribution of all topics, a poem might be mostly topic 1 but partly topic 2 and very little topic 3. In this application, there were 20 topics. (An arbitrary choice I made.) I would recommend a poem with similar topics; I tried a couple different definitions of 'similar' none of which ever produced very good or interpretable results.
I didn't think topic modeling worked very well in this application. When I looked at the resulting topics, which I've left below for the curious, some seem to make sense, but most seem random. I set alpha to 0.1 and document topic distributions are fairly pointed, i.e. the distribution is not very even, which is good. But the topic words don't make a lot of sense.
Even if the topics had made sense, the average number of words from the top 20 words in the top topic of a poem was 2.5 words per poem or only 1% of all words in the poem! It does not dramatically improve if I look at the top 3 topics: that gets me to 8.5 words or 4%. I tried generating far more topics; also doesn't help. This implies that the poems aren't strongly correlated with their top topics.
Some theories for this;
- My corpora isn't large enough. 5,000 poems is really not that many, especially given that many poems are short.
- Poems don't have enough words in them to strongly correlate with their topics. (Compared to, say, newspaper articles.)
- Topic modeling doesn't work well for poetry because poems don't repeat their topical words. The vocabulary for poetry is far larger than the vocabulary for e.g. newspapers. (I should get some hard numbers on this eventually.)
- The topics of poems are not reflected in their vocabulary. A poem about heartbreak could use flowers as a metaphor. How do we know the poem is about heartbreak? (I'm least convinced by this.)
The other problem with topic modeling is that I did not get the same topics each time. So if I ran the topic modeling 20 times, I'd get 20 different recommendations. While this has some benefits, it meant I was unable to really dig in to improve it because it would change so drastically in each iteration
What are these crazy topics that are generated! Great question. Below is the list of topics and the top 20 words in each topic. Not that all poems are "cleaned" which means I make everything lowercase and remove punctuation.
- poet, poetry, poem, rachel, words, write, joe, poems, read, book, name, page, does, writing, language, mind, end, these, line, two,
- god, let, death, dead, earth, world, lord, blood, men, soul, shall, name, die, whose, heart, great, father, live, must, children,
- body, face, hands, mouth, hand, head, against, blood, hair, black, behind, between, dead, flesh, fingers, skin, white, open, tongue, dark,
- thy, thou, shall, thee, yet, may, fair, did, nor, sweet, heart, let, o, whose, should, ye, art, away, dear, such,
- im, dont, get, just, got, want, think, right, tell, thats, hes, good, cant, says, ive, youre, look, ill, well, oh,
- america, black, white, war, new, used, la, san, american, states, river, people, great, company, radio, york, between, walt, money, indian,
- sea, water, fish, under, waves, ship, sand, shore, ocean, island, dream, salt, beach, bay, ships, white, wave, islands, waters, whose,
- —, men, king, oer, lord, war, sword, gold, nor, though, many, hall, folk, battle, hero, far, land, fell, fight, son,
- –, such, these, own, nor, yet, good, men, though, must, made, may, both, should, upon, make, mind, nature, against, set,
- pink, off, two, dog, blood, cut, ground, mud, fish, little, horse, hand, shot, water, fence, red, feet, eye, under, put,
- beyond, space, mind, dream, begins, eye, gold, sky, form, among, yes, clouds, sun, beatrice, inside, force, star, terrific, electric, planet,
- city, street, 2, 1, black, 3, town, dreams, past, 4, streets, face, window, train, new, floor, boys, toward, 5, men,
- room, white, little, two, water, house, blue, glass, door, around, black, table, off, mother, paper, woman, big, before, hands, red,
- o, de, gives, b, miss, s, r, c, y, mrs, boom, naomi, e, london, thru, boomlay, blessed, caw, lady, black,
- those, first, may, english, among, country, whose, reader, well, nor, fields, heaven, wine, turned, today, taking, various, ground, use, hate,
- nothing, other, even, been, just, something, always, without, think, must, because, after, though, things, much, being, years, another, why, first,
- while, air, heart, o, far, upon, deep, world, voice, little, round, wild, above, soul, stars, song, beneath, sleep, dark, hear,
- im, god, after, yr, dis, these, find, mark, sun, dem, broken, inside, again, gone, death, good, win, almost, white, rest,
- sun, wind, trees, sky, leaves, green, white, dark, rain, air, blue, tree, under, river, moon, cold, red, snow, grass, water,
- came, did, saw, went, took, knew, thought, made, after, years, mother, last, once, heard, turned, away, left, looked, father, stood,
At first I didn't remove any stop words, which had clear problems of the word 'the' being ranked highly for most topics. I've slowly been removing more and more stop words to try to get better topic formation. At this point I'm removing the top 100 occuring words in the corpus. They are listed below.
the and of a to in i that is with it you my on for as his he from was not but her like we all at me or they this be by are your no their one when its so what she have out there who an if our will into up where now then him were them had down which how love us through do can would more over back time see night some know could man light only still has day come here eyes way long old each life never about too say am go said than
Who am I.
Hi! My name is Katy and I became obsessed with the idea of a poetry recommendation engine in early 2016. I'm a writer-slash-engineer and what I like to do is make things. I used to make physical things (undergraduate degree in mechanical engineering) and now I make computer things (data scientist) but I've always made things with words. I blog about things I make here and about the books I read here.
A work in progress.
There are lots of things still to do! This project is still in its early stages.
- Get some features that describe rhyme! This will be hard but fun.
- I have some copyright information about the poems which I can parse to look for dates of publication and use that as a feature.
- I have tags from Poetry Foundation; this can also be used.
- Generate features that describe line length variability, white space usage, and punctuation usage.
- There are some online word catalogues that could allow me to create scores for concreteness, emotion, etc. based purely on the vocabulary.
Last updated July, 2017.