Listing all the unique words in a piece of text
This lesson demonstrates how to list each unique word in a piece of text.
The uniqueWords function
The uniqueWords function takes one parameter, pText. The function uses a repeat loop to check each word in turn creating an array variable named tWordsList. Each element of tWordsList is associated with a different word; the element's key is the word, and the element's contents is a number. For example, if the first word of the string is "Cans", then after the first word is processed, the array "wordsList" contains one element, named "Cans", which contains the number 1.
When a word is processed, the handler adds 1 to the element corresponding to that word. If there is no array element with that name already, one is created automatically by the add command. In general, changing a variable, a chunk in a variable, or an element in an array variable creates the variable, chunk, or element automatically, if it doesn't already exist. If there is already an element with that name, that is, if the word already exists in the array, 1 is added to that existing element.
After all the words have been processed, the function exits the repeat loop. At this point, the array variable tWordsList contains an element for each unique word, whose name is the word itself. The keys of tWordsList, therefore, is a list of all the unique words in the string.
LiveCode chunk expressions
This form of word-by-word processing is possible because LiveCode uses chunk expressions to manage text. A chunk expression is a way of describing a specific portion of a container. LiveCode can directly address individual words, characters, lines, and items (delimited by any character).
In this example, we use the repeat for each chunk form of the repeat control structure:
repeat for each word tWord in tString
This repeat structure loops through each word in the parameter pString, putting the current word into a variable called tWord. You can also loop through other chunk types in a repeat structure, processing each character, line, or item.
The uniqueWords function code
function uniqueWords pString
local tWordsList
repeat for each word tWord in pString
add 1 to tWordsList[tWord]
end repeat
return the keys of tWordsList
end uniqueWords
A note on efficiency
This example uses the repeat for each word form of the repeat control structure. When looping over chunk types in a string, this form is the fastest. The following repeat structure is functionally equivalent to the one in this example, but is much slower:
repeat with x = 1 to the number of words in pString
add 1 to wordsList[word x of pString]
end repeat
A note on trueWord
You may notice that the code as written returns strings of text with punctuation in them. A "word" is generally considered to be any sequence of non-whitespace characters followed by whitespace. As of LiveCode 7, one can instead use the trueWord
keyword, which is more discerning in that regard.
Richard M Kriesel
"A note on efficiency" above has a bug at "tWordsList" which should be "word" instead.
Richard M Kriesel
Since uniqueWords("a, b") returns words "a," and "b" this lesson ought to recognize that "word" here refers to a LiveCode word, and that "trueword" may better serve the user's needs.
Panos Merakos
Hello Richard,
Thank you for your comments.
We have now fixed the error in the code in "A note on efficiency". Thanks for spotting it.
RE the use of "trueWord" instead of "word" - yes, you are correct. However, "trueWord" was introduced in LC 7, so since there are (few) users that still use LC < 7 - I think will not update the example script and the sample stack, but I will add a note to the lesson.