Stata Training 1

8/3/2019 Stata Training 1

1/58

Introduction to Stata

Fitsum Zewdu

J. Research Fellow

EEPRI


2/58

The Stata Interface

Windows

The Stata windows give you all the key information about the data file you are using,recent commands, and the results of those commands. Some of them openautomatically when you start Stata, while others can be opened using theWindows pull-down menu or the buttons on the tool bar.

These are the Stata windows:

Stata Results To see recent commands and output

Stata Command To enter a command

Stata Browser To view the data file (needs to be opened)

Stata Editor To edit the data file (needs to be opened)

Stata Viewer To get help on how to use Stata

Variables To see a list of variables

Review To see recent commands

Stata Do-file Editor To write or edit a program (needs to be opened)


3/58


4/58

MenusStata displays 9 drop-down menus across the top of the outer window, from left to right:File

Open open a Stata data file (use)Save/Save as save the Stata data in memory to diskDo execute a do-file

Filename copy a filename to the command linePrint print log or graphExit quit Stata

EditCopy/Paste copy text among the Command, Results, and Log windows

Copy Table copy table from Results window to another fileTable copy options what to do with table lines in Copy Table

Prefs Various options for setting preferences. For example, you can savea particularly layout of the different Stata windows or change the

colors used in Stata windows.DataGraphicsStatistics build and run Stata commands from menusUser menus for user-supplied Stata commands (download from Internet)

Window bring a Stata window to the frontHelp Stata command syntax and keyword searches


5/58

Button bar

The buttons on the button bar are from left to right (equivalent command is in bold):

Open a Stata data file: use

Save the Stata data in memory to disk: save

Print a log or graph

Open a log, or suspend/close an open log: log

Open a new viewer

Bring Results window to front

Bring Graph window to front

New Dofile Editor: doedit

Edit the data in memory: edit

Browse the data in memory: browseScroll another page when --more-- is displayed: Space Bar

Stop current command or do-file: Ctrl-Break


6/58

EXPLORING DATA FILES

Common Stata Syntax

This section covers commands that are used forpreliminary exploration of data in a file. Statacommands follow the same syntax:

[by varilist1:] command [varlist2] [ifexp] [in range][weight], [options]

Items inside of the squares brackets are eitheroptions or not available for every command. Thissyntax applies to all Stata commands


7/58

Logical operators used in Stata

~ Not

== Equal

~= not equal

!= not equal

> greater than

>= greater than or equal

< less than


8/58

Examining dataset

Clear

The clear command deletes all files, variables, and

labels from the memory to get ready to use a new

data file

You can clear memory using the clear command or

by using the clear up command as part of the use

command This command does not delete any data saved to

the hard-drive


9/58

Examining dataset

set memory First you can check to see how much memory is

allocated to hold your data using the memorycommand

By default we have 11MB free for reading in a datafile.

Whenever we want to read data file bigger than thisfree bytes, we will get the error message read as:

no room to add more observations

r(901);


10/58

. memorybytes

--------------------------------------------------------------------Details of set memory usage

overhead (pointers) 5,808 0.06%data 107,448 1.02%

----------------------------data + overhead 113,256 1.08%

free 10,372,496 98.92%----------------------------

Total allocated 10,485,752 100.00%--------------------------------------------------------------------Other memory usage

set maxvar usage 1,816,666set matsize usage 1,315,200programs, saved results, etc. 3,338

---------------

Total 3,135,204-------------------------------------------------------Grand total 13,620,956


11/58

Examining dataset

In this case wehave to allocate to more memory, say

25MB (if 25MB are sufficient for current file), with the

set memory command before trying to use our file.

set memory 25m Now that we have allocated enough memory, we will

be able to read bigger files provided that it is within

the specified memory spaces

If we want to allocate 25m (25 megabytes) every time

we start Stata, We can type;

set memory 250m, permanently


12/58

Examining dataset

Use This command opens an existing Stata data file. The syntax is:

use filename [, clear ]opens new file

use [varlist] [if exp] [in range] using filename [, clear ]opens selected parts of file

If there is no extension, Stata assumes it is .dta.

If there is no path, Stata assumes it is in the current folder.

You can use a path name such as: use C:\...\ERHScons1999

If the path name has spaces, you must use double quotes:use .d:\my data\ERHScons1999

You can open selected variables of a file using a variable list.

You can open selected records of a file using ifor in.


13/58

Examining dataset

Here are some examples of the use command:

use ERHScons1999 opens the file ERHScons1999.dta foranalysis.

use ERHScons1999 if q1a == 1 opens data from region 1

use ERHScons1999 in 5/25 opens records 5 through 25 of

file use hhid hhsize cons using ERHScons1999

opens 3 variables from ERHScons1999 file

use C:\training\ ERHScons1999 opens the file ERHScons1999.dta in thespecifiedfolder

use .C:\data files\ ERHScons1999 use quotation marks if there are

spaces use ERHScons1999, clear clears memory before opening the new

file


14/58

Examining dataset

save The save command will save the dataset as a .dta file

under the name you choose. Editing the dataset changesdata in the computer's memory, it does not change the

data that is stored on the computer's disk.

save C:\...\consumption.dta, replace

T

he replace option allows you to save a changed file to thedisk, replacing the original file. Stata is worried that youwill accidentally overwrite your data file. You need to usethe replace option to tell Stata that you know that the fileexists and you want to replace it.


15/58

Examining dataset

edit

This command use to open window called dataeditor windowthat allow us to view all

observation in the memory. You can change the data using data editor window

but it is not recommend to edit data using thiswindow

It is better to correct errors in the data using a Do-file program that can be saved (we will see Do-fileprogram latter).


16/58

Examining dataset

browse

This window is exactly like the Stata editor windowexcept that you cant change the data

describe This command provides a brief description of the data

file. You can use des or d and Stata willunderstand. The output includes:

the number of variables the number of observations (records)

the size of the file

the list of variables and their characteristics


17/58

Example 1: Using describe to show information about a data file. des

Contains data from C:\training\ERHSCONS1999.dta

obs: 1,452vars: 15 24 Feb 2007 07:07size: 113,256 (98.9% of memory free) (_dta has notes)

-----------------------------------------------------------------------------storage display value

variable name type format label variable label-----------------------------------------------------------------------------q1a float %9.0g reg Regionq1b double %15.0g w Wereda

q1c double %17.0g pa Peseant associationq1d double %12.0g Household idsexh byte %8.0g sexhh Sex of household headageh float %9.0g p1s1q4 Age of household headcons float %9.0g consumption per monthfood float %9.0g food cons per monthhhsize byte %8.0g household sizeaeu float %9.0g adult equivalent units in

household

fpi float %9.0g food price indexrconspc float %9.0g real consumption per capita

1994 pricesrconsae float %9.0g real consumption per adult 1994

pricespoor double %8.2fhhid double %12.0f selected household unique id-----------------------------------------------------------------------------Sorted by: hhid


18/58

Examining dataset

list

This command lists values of variables in data set.

The syntax is:

list [varlist] [if exp] [in range]

examples:

. list lists entire dataset

. list in 1/10 lists observations 1 through 10

. list hhsize q1a food lists selected variables

. list hhsize sex in 1/20 lists observations 1-20 for selected

variables

. list ifq1a < 6 lists cases in region is 1 through 5


19/58

Examining dataset

if This command is used to select certain records in carrying out a

command

command ifexp

Examples: . list hhid q1a food iffood>12000 lists data if food is above

12000

. tab q1a ifcons>10000 &cons=1200 browse data if food consumption isabove 12000

Note that if statements always use ==, not a single =. Also note that | indicatesor while & indicates and


20/58

Examining dataset

in

We have also used in to select records based on

the case number. The syntax is:

command in exp

For example:

. list in 10 list observation number 10

. summarize in 10/20 summarize observations

10-20

. l in -10/-1 list the last 10 observations


21/58

Examining dataset

codebook

The codebook command is a great tool for getting

a quick overview of the variables in the data file.

It produces a kind of electronic codebook from the

data file, displaying information about variables'

names, labels and values. codebook

sexh Sex of household head----------------------------------------------------------------------------

type: numeric (byte)label: sexhh

range: [0,1] units: 1unique values: 2 missing .: 0/1452

tabulation: Freq. Numeric Label400 0 Female1052 1 Male


22/58

Examining dataset

inspect

It is another useful command for getting a quick

overview of a data file.

inspect command displays information about the

values of variables and is useful for checking data

accuracy. inspect sexh

sexh: Sex of household head Number of Observations

---------------------------- Non-

Total Integers Integers| # Negative - - -

| # Zero 400 400 -

| # Positive 1052 1052 -

| # ----- ----- -----

| # # Total 1452 1452 -

| # # Missing -

+---------------------- -----

0 1 1452

(2 unique values)

sexh is labeled and all values are documented in the label.


23/58

Examining dataset

count

count command can be used to show the numberof observations that satisfying if options. If no

conditions are specified, count displays thenumber of observations in the data.

. count

1452

. count if q1a==3

466


24/58

Descriptive Statistics

tabulate, tab1, tab2

These are three related commands that

produce frequency tables for discretevariables.

They can produce one-way frequency tables

(tables with the frequency of one variable)

or two-way frequency tables (tables with a

row variable and a column variable.


25/58


tabulate or tab produce a frequency table

for one or two variables

tab1 produces a one-wayfrequency table for each

variable in the variable list

tab2 produces all possible two-

variable tables from the

list of variables


26/58


You can use several options with these commands:

all gives all the tests of association for two-waytables

cell gives the overall percentage for two-way

tables column gives column percentages for two-way

tables

row gives row percentages for two-way tables

nofreq suppresses printing the frequencies.

chi2 provides the chi squared test for two-waytables

There are many other options, including other statistical tests. For more information,type help tabulate


27/58


Some examples of the tabulate commands are:

. tabulate q1a produces table of frequency by region

. tabulate q1a sexh produces a cross-tab offrequencies by region and sex of head

. tabulate q1a hhsize, row produces a cross-tab byregion and hhsize with rowpercentages

. tabulate sexh hhsize, cell nofreq produces a cross-tab of overallpercent by sex and hhsize.

. tab1 q1a q1b hhsize produces three tables, a

frequency table for eachvariable

. tab2 q1a poor sexh produces three tables, a cross-tab of each pair of variables


28/58


summarize The summarize command produces statistics on continuous variables like age,

food, cons hhsize. The syntax looks like this:

summarize [varlist] [if exp] [in range] [, [detail]]

By default, it produces the following statistics:

Number of observations Average (or mean)

Standard deviation

Minimum

Maximum

If you specify detail Stata gives you additional statistics, such as

skewness, kurtosis,

the four smallest values

the four largest values

various percentiles.


29/58


Here are some examples:

. summarize gives statistics on

all variables

. summarize hhsize food gives statistics on

selected

variables

. summarize hhsize cons if q1a==3 gives statistics ontwo variables for

one region


30/58


by

This prefix goes before a command and asks Stata torepeat the command for each value of a variable. The

general syntax is:by varlist: command

Note: bysortcommand is most commonly used toshorten the sorting process

example of the by prefix are: bysort sex: sum rconsae for sex of hh head, give stats on real per

capita consumption.


31/58


help

The help command gives you information about anyStata command or topic

help [command]For example,

. help tabulate gives a description ofthe tabulate command

. help summarize gives a description of thesummarize command


32/58

STORING COMMANDS AND OUTPUT

The following topics are covered:

Using the Do-file Editor

log using

log off

log on

log close

set logtype to move tables from Stata to Word and

Excel


33/58


Using the Do-file Editor

The Do-file Editor allows you to store a program

(a set of commands),

It makes it easier to check and fix errors,

It allows you to run the commands later,

It lets you show others how you got your result,

and It allows you to collaborate with others on the

analysis.


34/58


In general, any time you are running more

than 10 commands to get a result, it is easier

and safer to use a Do-file to store the

commands.

To open the Do-file Editor, you can click on

Windows/Do-file Editor or click on the

envelope on the Tool Bar.


35/58


keyboard commands are quicker to use than

the buttons. The most useful ones are:

Control-O Open file

Control-S Save file

Control-C Copy

Control-X Cut

Control-V Paste

Control-Z Undo

Control-F Find

Control-H Find and Replace


36/58


To run the commands in a Do-file,

you can click on the Do button (the second-to-last

one) or

click on Tools/Do.

If you want to run one or just a few commands

rather than the whole file, mark the commands

and click on the Do buttonNote: Ifyou would like to add a note to a do file, but

do not want Stata to execute your notes, /* */ is

used


37/58


Saving the Output

Stata Results window does not keep all the output

you generate.

It only stores about 300-600 lines, and when it is

full, it begins to delete the old results as you add

new results.

Thus, we need to use log to save the output


38/58


log using

This command creates a file with a copy of all the

commands and output from Stata. The syntax is:

log using filename [, append replace [ text | smcl ] ]

append adds the output to an existing file

replace replaces an existing file with the output

text tells Stata to create the log file in text

(ASCII) format

smcl tells Stata to create the log file in

SMCL format


39/58


Here are some examples:

log using temp22 saves output to a file

called temp22

log using temp22, replacesaves output to an existing file,

temp20, replacing content

log using temp22, append

saves output to an existingfile, results, adding to contents

log using .d:\my data\myfile.txt.

saves output in specified file in

specified folder


40/58


log off

This command temporarily turns off the logging of

output,

log on

This command is used to restart the logging,

log close

This command is used to turn off the logging and

save the file.


41/58


set logtype text

This command tells Stata to always save the log

files in text (ASCII) format

set logtype smcl

This command tells Stata to always save log files in

SMCL format.


42/58

CREATING NEW VARIABLES

We have how to explore the data using

existing variables so far.

Now we will discuss how to create new

variables.

When new variables are created, they are in

memory and they will appear in the Data

Browser, but they will not be saved on the

hard-disk unless you use the save command.


43/58

generate

This command is used to create a new variable. It

is similar to compute in SPSS.

The syntax is;

generate newvar = exp [if exp]

where exp is an expression like

price*quant or

1000*kg


44/58

Cannot be used to change the definition of an

existing variable

You can use gen or g as an abbreviation

for generate

If the expression is an equality or inequality,

the variable will take the values 0 if the

expression is false and 1 if it is true

If you use if, the new variable will have

missing values when the if statement is false


45/58

For example,

generate age2 = age*age

create age squared variable

gen yield = outputkg/area if area>0

create new yield variable if area is positive

gen price = value/quant if quant>0

create new price variable if quant is positive

gen highprice = (price>1000)

creates a dummy variable equal to 1 for high prices


46/58

replace

This command is used to change the definition of

an existing variable.

The syntax is the same:

replace oldvar = exp [if exp] [in exp]


47/58

For example,

replace price = avgprice if price > 100000

replaces high values with an average price

replace income =. if income


48/58

tabulate generate

This command is useful for creating a set of

dummy variables (variables with a value of 0 or 1)

depending on the value of an existing categoricalvariable.

The syntax is:

tabulate oldvariable, generate(newvariable)


49/58

tab q1a, gen(region)

This creates 6 new variables:

region1=1 if q1a=1 and 0 otherwise

region2 =1 if q1a =3 and 0 otherwise

region8=1 if q1a =8 and 0 otherwise


50/58

egen

This is an extended version of

generate[extended generate] to create a new

variable by aggregating the existing data.

The syntax is:

egen newvar = fcn(arguments) [if exp] [in range] , by(var)


51/58

count() number of non-missing values

diff() compares variables, 1 if different, 0 otherwise

fill() fill with a pattern

group() creates a group id from a list of variables

iqr() interquartile range

ma() moving average

max() maximum value

mean() mean

median() median

min() minimum value

pctile() percentile

rank () rank

rmean() mean across variables

sd () standard deviation

std() standardize variables

sum () sums


52/58

egen avg = mean(yield)

creates variable of average yield over

entire sample

egen avg2 = median(income), by(sex)

creates variable of median income for each

sex

egen regprod = sum(prod), by(region)

creates variable of total production for

each region


53/58

Exercise,

we want to know which households haveexpenditure (cons) above the village average.

I.e. Create a dummy (1 for those whoconsume above the village/peasant

association and 0 otherwise)


54/58

egen avecon=mean(cons), by( q1c)

gen highavecon=(cons> avecon)

list hhid q1c cons avecon highavecon in650/675


55/58

Arithmetic

+ addition

- subtraction

* multiplication/ division

^ power

Logical

~ not

| or

& and

Relational

> greater than

< less than

>= more than or equal


56/58

Here are some examples to illustrate the use

of these operators. Suppose you want you

create a

dummy variable indicating households in the

Amhara region. One way is to write:

generate AmD = 0

replace AmD = 1 if q1a==3 Or you can get exactly the same result with just

one command:

generate AmD = (q1a==3)


57/58

For example, a household head must be

female head and in Dodota wereda to be

selected.

gen DDfemale = 0

replace DDfemale = 1 if q1b==9 & sexh==0

or an easier way to do this would be:

gen DDfemale = (q1b==9 & sexh==0)


58/58

abs(x) computes the absolute value of xexp(x) calculates e to the x power.

ln(x) computes the natural logarithm of xlog(x) is a synonym for ln(x), the natural logarithm.

log10(x) computes the log base 10 of x.sqrt(x) computes the square root of x.

invnorm(p) provides the inverse cumulative normal; invnorm(norm(z)) = z.

normden(z) provides the standard normal density.normden(z,s) provides the normal density. normden(z,s) = normden(z)/s if s>0 and s not

missing, otherwise, the result is missing.

norm(z) provides the cumulative standard normal.group(x) creates a categorical variable that divides the data into x as nearly equal-

sized subsamples as possible, numbering the first group 1, the secondgroup 2, etc. It uses the current order of the data.

int(x) gives the integer obtained by truncating x.round(x,y) gives x rounded into units of y.

Stata Training 1

Documents

Transcript of Stata Training 1