The easy way to find the manufacturer is to use the word() function. It takes two arguments: the first is the string you want to extract a word from and the second is the number of the word you want. Since the manufacturer is always the first word of make, all you need is:
gen manufacturer=word(make,1)
You do not need to set a variable type, despite the instruction to use the smallest possible variable type: Stata automatically makes new strings as small as possible.
To turn manufacturer into a numeric categorical variable, use the group() function in the egen library. It takes a single argument, which can be a variable or variable list, and assigns group numbers based on that variable.
egen byte manCat=group(manufacturer)
There are just 23 categories, and byte, the smallest variable type of all, can handle integers up to 100. That makes it the ideal choice for manCat. If you didn't know how many categories you had you might have to create the variable, find out what its maximum value is, and then drop it and recreate it using the proper type. But it's rare for categorical variables to need anything bigger than byte.
To see what you've done so far, do a list:
l make manufacturer manCat
You'll see that manCat jumps from 1 to 4. That's because group numbers are assigned based on the sort order of the group variable, not the current sort order of the data. Thus AMC is manufacturer #1, Audi #2 and BMW #3, but Audi and BWM are down with the foreign cars. To see them in order, type:
sort manufacturer
and then repeat the list.
Once you've got a numeric categorical variable, obtaining a set of dummies is easy with factor notation. It also lets you avoid creating new variables to store them: the most memory efficient variable of all is the variable you don't create.
l manCat i.manCat
There are many ways to extract manufacturer from make, and it's good to know several of them. Another option is the oddly-named egen function ends(), with the head option. This will give you the first word of the string:
egen manufacturer2=ends(make), head
The last option would give you the last word, and the tail option would give you all but the first word. But the advantage of ends() over word() is that ends() has a punct() option which lets you divide strings into "words" based on characters other than spaces. Thus if you had a variable fullname containing Dimond,Russell you could do:
egen firstname=ends(fullname), punc(",") last
egen lastname=ends(fullname), punc(",") head
The most flexible method uses substr(), but, as usual, flexibility implies complexity. substr() takes three arguments: the string you want to extract a substring from, the location where the substring should start, and the number of characters it should contain. Since we want the first part of make, the starting location is just 1. The trick is that the length of each manufacturer is different. However, we know we've hit the end of the manufacturer when we see a space, and we can use the strpos() function to find the space. strpos() takes two strings as arguments and returns the location of the second string within the first--or zero if the second string is not in the first, which can also be useful. Thus to find manufacturer using substr() you would type:
gen manufacturer3=substr(make,1,strpos(make," "))
This gives you one missing value, the car whose make is just Subaru. While word() and ends() interpreted Subaru as the first word, it confused our substr() method: strpos(make," ") returns zero because Subaru doesn't contain a space, and when substr() is asked to make a substring of zero length it responds with missing.
The moral of this story: use word() if you can, because it's so easy. But if you need to extract data from a complex piece of text (say, the HTML source code of a web page) substr() and strpos() may be your only hope.
Last Revised: 11/20/2009
