Stata Training 1

download Stata Training 1

of 58

Transcript of Stata Training 1

  • 8/3/2019 Stata Training 1

    1/58

    Introduction to Stata

    Fitsum Zewdu

    J. Research Fellow

    EEPRI

  • 8/3/2019 Stata Training 1

    2/58

    The Stata Interface

    Windows

    The Stata windows give you all the key information about the data file you are using,recent commands, and the results of those commands. Some of them openautomatically when you start Stata, while others can be opened using theWindows pull-down menu or the buttons on the tool bar.

    These are the Stata windows:

    Stata Results To see recent commands and output

    Stata Command To enter a command

    Stata Browser To view the data file (needs to be opened)

    Stata Editor To edit the data file (needs to be opened)

    Stata Viewer To get help on how to use Stata

    Variables To see a list of variables

    Review To see recent commands

    Stata Do-file Editor To write or edit a program (needs to be opened)

  • 8/3/2019 Stata Training 1

    3/58

  • 8/3/2019 Stata Training 1

    4/58

    MenusStata displays 9 drop-down menus across the top of the outer window, from left to right:File

    Open open a Stata data file (use)Save/Save as save the Stata data in memory to diskDo execute a do-file

    Filename copy a filename to the command linePrint print log or graphExit quit Stata

    EditCopy/Paste copy text among the Command, Results, and Log windows

    Copy Table copy table from Results window to another fileTable copy options what to do with table lines in Copy Table

    Prefs Various options for setting preferences. For example, you can savea particularly layout of the different Stata windows or change the

    colors used in Stata windows.DataGraphicsStatistics build and run Stata commands from menusUser menus for user-supplied Stata commands (download from Internet)

    Window bring a Stata window to the frontHelp Stata command syntax and keyword searches

  • 8/3/2019 Stata Training 1

    5/58

    Button bar

    The buttons on the button bar are from left to right (equivalent command is in bold):

    Open a Stata data file: use

    Save the Stata data in memory to disk: save

    Print a log or graph

    Open a log, or suspend/close an open log: log

    Open a new viewer

    Bring Results window to front

    Bring Graph window to front

    New Dofile Editor: doedit

    Edit the data in memory: edit

    Browse the data in memory: browseScroll another page when --more-- is displayed: Space Bar

    Stop current command or do-file: Ctrl-Break

  • 8/3/2019 Stata Training 1

    6/58

    EXPLORING DATA FILES

    Common Stata Syntax

    This section covers commands that are used forpreliminary exploration of data in a file. Statacommands follow the same syntax:

    [by varilist1:] command [varlist2] [ifexp] [in range][weight], [options]

    Items inside of the squares brackets are eitheroptions or not available for every command. Thissyntax applies to all Stata commands

  • 8/3/2019 Stata Training 1

    7/58

    Logical operators used in Stata

    ~ Not

    == Equal

    ~= not equal

    != not equal

    > greater than

    >= greater than or equal

    < less than

  • 8/3/2019 Stata Training 1

    8/58

    Examining dataset

    Clear

    The clear command deletes all files, variables, and

    labels from the memory to get ready to use a new

    data file

    You can clear memory using the clear command or

    by using the clear up command as part of the use

    command This command does not delete any data saved to

    the hard-drive

  • 8/3/2019 Stata Training 1

    9/58

    Examining dataset

    set memory First you can check to see how much memory is

    allocated to hold your data using the memorycommand

    By default we have 11MB free for reading in a datafile.

    Whenever we want to read data file bigger than thisfree bytes, we will get the error message read as:

    no room to add more observations

    r(901);

  • 8/3/2019 Stata Training 1

    10/58

    . memorybytes

    --------------------------------------------------------------------Details of set memory usage

    overhead (pointers) 5,808 0.06%data 107,448 1.02%

    ----------------------------data + overhead 113,256 1.08%

    free 10,372,496 98.92%----------------------------

    Total allocated 10,485,752 100.00%--------------------------------------------------------------------Other memory usage

    set maxvar usage 1,816,666set matsize usage 1,315,200programs, saved results, etc. 3,338

    ---------------

    Total 3,135,204-------------------------------------------------------Grand total 13,620,956

  • 8/3/2019 Stata Training 1

    11/58

    Examining dataset

    In this case wehave to allocate to more memory, say

    25MB (if 25MB are sufficient for current file), with the

    set memory command before trying to use our file.

    set memory 25m Now that we have allocated enough memory, we will

    be able to read bigger files provided that it is within

    the specified memory spaces

    If we want to allocate 25m (25 megabytes) every time

    we start Stata, We can type;

    set memory 250m, permanently

  • 8/3/2019 Stata Training 1

    12/58

    Examining dataset

    Use This command opens an existing Stata data file. The syntax is:

    use filename [, clear ]opens new file

    use [varlist] [if exp] [in range] using filename [, clear ]opens selected parts of file

    If there is no extension, Stata assumes it is .dta.

    If there is no path, Stata assumes it is in the current folder.

    You can use a path name such as: use C:\...\ERHScons1999

    If the path name has spaces, you must use double quotes:use .d:\my data\ERHScons1999

    You can open selected variables of a file using a variable list.

    You can open selected records of a file using ifor in.

  • 8/3/2019 Stata Training 1

    13/58

    Examining dataset

    Here are some examples of the use command:

    use ERHScons1999 opens the file ERHScons1999.dta foranalysis.

    use ERHScons1999 if q1a == 1 opens data from region 1

    use ERHScons1999 in 5/25 opens records 5 through 25 of

    file use hhid hhsize cons using ERHScons1999

    opens 3 variables from ERHScons1999 file

    use C:\training\ ERHScons1999 opens the file ERHScons1999.dta in thespecifiedfolder

    use .C:\data files\ ERHScons1999 use quotation marks if there are

    spaces use ERHScons1999, clear clears memory before opening the new

    file

  • 8/3/2019 Stata Training 1

    14/58

    Examining dataset

    save The save command will save the dataset as a .dta file

    under the name you choose. Editing the dataset changesdata in the computer's memory, it does not change the

    data that is stored on the computer's disk.

    save C:\...\consumption.dta, replace

    T

    he replace option allows you to save a changed file to thedisk, replacing the original file. Stata is worried that youwill accidentally overwrite your data file. You need to usethe replace option to tell Stata that you know that the fileexists and you want to replace it.

  • 8/3/2019 Stata Training 1

    15/58

    Examining dataset

    edit

    This command use to open window called dataeditor windowthat allow us to view all

    observation in the memory. You can change the data using data editor window

    but it is not recommend to edit data using thiswindow

    It is better to correct errors in the data using a Do-file program that can be saved (we will see Do-fileprogram latter).

  • 8/3/2019 Stata Training 1

    16/58

    Examining dataset

    browse

    This window is exactly like the Stata editor windowexcept that you cant change the data

    describe This command provides a brief description of the data

    file. You can use des or d and Stata willunderstand. The output includes:

    the number of variables the number of observations (records)

    the size of the file

    the list of variables and their characteristics

  • 8/3/2019 Stata Training 1

    17/58

    Example 1: Using describe to show information about a data file. des

    Contains data from C:\training\ERHSCONS1999.dta

    obs: 1,452vars: 15 24 Feb 2007 07:07size: 113,256 (98.9% of memory free) (_dta has notes)

    -----------------------------------------------------------------------------storage display value

    variable name type format label variable label-----------------------------------------------------------------------------q1a float %9.0g reg Regionq1b double %15.0g w Wereda

    q1c double %17.0g pa Peseant associationq1d double %12.0g Household idsexh byte %8.0g sexhh Sex of household headageh float %9.0g p1s1q4 Age of household headcons float %9.0g consumption per monthfood float %9.0g food cons per monthhhsize byte %8.0g household sizeaeu float %9.0g adult equivalent units in

    household

    fpi float %9.0g food price indexrconspc float %9.0g real consumption per capita

    1994 pricesrconsae float %9.0g real consumption per adult 1994

    pricespoor double %8.2fhhid double %12.0f selected household unique id-----------------------------------------------------------------------------Sorted by: hhid

  • 8/3/2019 Stata Training 1

    18/58

    Examining dataset

    list

    This command lists values of variables in data set.

    The syntax is:

    list [varlist] [if exp] [in range]

    examples:

    . list lists entire dataset

    . list in 1/10 lists observations 1 through 10

    . list hhsize q1a food lists selected variables

    . list hhsize sex in 1/20 lists observations 1-20 for selected

    variables

    . list ifq1a < 6 lists cases in region is 1 through 5

  • 8/3/2019 Stata Training 1

    19/58

    Examining dataset

    if This command is used to select certain records in carrying out a

    command

    command ifexp

    Examples: . list hhid q1a food iffood>12000 lists data if food is above

    12000

    . tab q1a ifcons>10000 &cons=1200 browse data if food consumption isabove 12000

    Note that if statements always use ==, not a single =. Also note that | indicatesor while & indicates and

  • 8/3/2019 Stata Training 1

    20/58

    Examining dataset

    in

    We have also used in to select records based on

    the case number. The syntax is:

    command in exp

    For example:

    . list in 10 list observation number 10

    . summarize in 10/20 summarize observations

    10-20

    . l in -10/-1 list the last 10 observations

  • 8/3/2019 Stata Training 1

    21/58

    Examining dataset

    codebook

    The codebook command is a great tool for getting

    a quick overview of the variables in the data file.

    It produces a kind of electronic codebook from the

    data file, displaying information about variables'

    names, labels and values. codebook

    sexh Sex of household head----------------------------------------------------------------------------

    type: numeric (byte)label: sexhh

    range: [0,1] units: 1unique values: 2 missing .: 0/1452

    tabulation: Freq. Numeric Label400 0 Female1052 1 Male

  • 8/3/2019 Stata Training 1

    22/58

    Examining dataset

    inspect

    It is another useful command for getting a quick

    overview of a data file.

    inspect command displays information about the

    values of variables and is useful for checking data

    accuracy. inspect sexh

    sexh: Sex of household head Number of Observations

    ---------------------------- Non-

    Total Integers Integers| # Negative - - -

    | # Zero 400 400 -

    | # Positive 1052 1052 -

    | # ----- ----- -----

    | # # Total 1452 1452 -

    | # # Missing -

    +---------------------- -----

    0 1 1452

    (2 unique values)

    sexh is labeled and all values are documented in the label.

  • 8/3/2019 Stata Training 1

    23/58

    Examining dataset

    count

    count command can be used to show the numberof observations that satisfying if options. If no

    conditions are specified, count displays thenumber of observations in the data.

    . count

    1452

    . count if q1a==3

    466

  • 8/3/2019 Stata Training 1

    24/58

    Descriptive Statistics

    tabulate, tab1, tab2

    These are three related commands that

    produce frequency tables for discretevariables.

    They can produce one-way frequency tables

    (tables with the frequency of one variable)

    or two-way frequency tables (tables with a

    row variable and a column variable.

  • 8/3/2019 Stata Training 1

    25/58

    Descriptive Statistics

    tabulate or tab produce a frequency table

    for one or two variables

    tab1 produces a one-wayfrequency table for each

    variable in the variable list

    tab2 produces all possible two-

    variable tables from the

    list of variables

  • 8/3/2019 Stata Training 1

    26/58

    Descriptive Statistics

    You can use several options with these commands:

    all gives all the tests of association for two-waytables

    cell gives the overall percentage for two-way

    tables column gives column percentages for two-way

    tables

    row gives row percentages for two-way tables

    nofreq suppresses printing the frequencies.

    chi2 provides the chi squared test for two-waytables

    There are many other options, including other statistical tests. For more information,type help tabulate

  • 8/3/2019 Stata Training 1

    27/58

    Descriptive Statistics

    Some examples of the tabulate commands are:

    . tabulate q1a produces table of frequency by region

    . tabulate q1a sexh produces a cross-tab offrequencies by region and sex of head

    . tabulate q1a hhsize, row produces a cross-tab byregion and hhsize with rowpercentages

    . tabulate sexh hhsize, cell nofreq produces a cross-tab of overallpercent by sex and hhsize.

    . tab1 q1a q1b hhsize produces three tables, a

    frequency table for eachvariable

    . tab2 q1a poor sexh produces three tables, a cross-tab of each pair of variables

  • 8/3/2019 Stata Training 1

    28/58

    Descriptive Statistics

    summarize The summarize command produces statistics on continuous variables like age,

    food, cons hhsize. The syntax looks like this:

    summarize [varlist] [if exp] [in range] [, [detail]]

    By default, it produces the following statistics:

    Number of observations Average (or mean)

    Standard deviation

    Minimum

    Maximum

    If you specify detail Stata gives you additional statistics, such as

    skewness, kurtosis,

    the four smallest values

    the four largest values

    various percentiles.

  • 8/3/2019 Stata Training 1

    29/58

    Descriptive Statistics

    Here are some examples:

    . summarize gives statistics on

    all variables

    . summarize hhsize food gives statistics on

    selected

    variables

    . summarize hhsize cons if q1a==3 gives statistics ontwo variables for

    one region

  • 8/3/2019 Stata Training 1

    30/58

    Descriptive Statistics

    by

    This prefix goes before a command and asks Stata torepeat the command for each value of a variable. The

    general syntax is:by varlist: command

    Note: bysortcommand is most commonly used toshorten the sorting process

    example of the by prefix are: bysort sex: sum rconsae for sex of hh head, give stats on real per

    capita consumption.

  • 8/3/2019 Stata Training 1

    31/58

    Descriptive Statistics

    help

    The help command gives you information about anyStata command or topic

    help [command]For example,

    . help tabulate gives a description ofthe tabulate command

    . help summarize gives a description of thesummarize command

  • 8/3/2019 Stata Training 1

    32/58

    STORING COMMANDS AND OUTPUT

    The following topics are covered:

    Using the Do-file Editor

    log using

    log off

    log on

    log close

    set logtype to move tables from Stata to Word and

    Excel

  • 8/3/2019 Stata Training 1

    33/58

    STORING COMMANDS AND OUTPUT

    Using the Do-file Editor

    The Do-file Editor allows you to store a program

    (a set of commands),

    It makes it easier to check and fix errors,

    It allows you to run the commands later,

    It lets you show others how you got your result,

    and It allows you to collaborate with others on the

    analysis.

  • 8/3/2019 Stata Training 1

    34/58

    STORING COMMANDS AND OUTPUT

    In general, any time you are running more

    than 10 commands to get a result, it is easier

    and safer to use a Do-file to store the

    commands.

    To open the Do-file Editor, you can click on

    Windows/Do-file Editor or click on the

    envelope on the Tool Bar.

  • 8/3/2019 Stata Training 1

    35/58

    STORING COMMANDS AND OUTPUT

    keyboard commands are quicker to use than

    the buttons. The most useful ones are:

    Control-O Open file

    Control-S Save file

    Control-C Copy

    Control-X Cut

    Control-V Paste

    Control-Z Undo

    Control-F Find

    Control-H Find and Replace

  • 8/3/2019 Stata Training 1

    36/58

    STORING COMMANDS AND OUTPUT

    To run the commands in a Do-file,

    you can click on the Do button (the second-to-last

    one) or

    click on Tools/Do.

    If you want to run one or just a few commands

    rather than the whole file, mark the commands

    and click on the Do buttonNote: Ifyou would like to add a note to a do file, but

    do not want Stata to execute your notes, /* */ is

    used

  • 8/3/2019 Stata Training 1

    37/58

    STORING COMMANDS AND OUTPUT

    Saving the Output

    Stata Results window does not keep all the output

    you generate.

    It only stores about 300-600 lines, and when it is

    full, it begins to delete the old results as you add

    new results.

    Thus, we need to use log to save the output

  • 8/3/2019 Stata Training 1

    38/58

    STORING COMMANDS AND OUTPUT

    log using

    This command creates a file with a copy of all the

    commands and output from Stata. The syntax is:

    log using filename [, append replace [ text | smcl ] ]

    append adds the output to an existing file

    replace replaces an existing file with the output

    text tells Stata to create the log file in text

    (ASCII) format

    smcl tells Stata to create the log file in

    SMCL format

  • 8/3/2019 Stata Training 1

    39/58

    STORING COMMANDS AND OUTPUT

    Here are some examples:

    log using temp22 saves output to a file

    called temp22

    log using temp22, replacesaves output to an existing file,

    temp20, replacing content

    log using temp22, append

    saves output to an existingfile, results, adding to contents

    log using .d:\my data\myfile.txt.

    saves output in specified file in

    specified folder

  • 8/3/2019 Stata Training 1

    40/58

    STORING COMMANDS AND OUTPUT

    log off

    This command temporarily turns off the logging of

    output,

    log on

    This command is used to restart the logging,

    log close

    This command is used to turn off the logging and

    save the file.

  • 8/3/2019 Stata Training 1

    41/58

    STORING COMMANDS AND OUTPUT

    set logtype text

    This command tells Stata to always save the log

    files in text (ASCII) format

    set logtype smcl

    This command tells Stata to always save log files in

    SMCL format.

  • 8/3/2019 Stata Training 1

    42/58

    CREATING NEW VARIABLES

    We have how to explore the data using

    existing variables so far.

    Now we will discuss how to create new

    variables.

    When new variables are created, they are in

    memory and they will appear in the Data

    Browser, but they will not be saved on the

    hard-disk unless you use the save command.

  • 8/3/2019 Stata Training 1

    43/58

    generate

    This command is used to create a new variable. It

    is similar to compute in SPSS.

    The syntax is;

    generate newvar = exp [if exp]

    where exp is an expression like

    price*quant or

    1000*kg

  • 8/3/2019 Stata Training 1

    44/58

    Cannot be used to change the definition of an

    existing variable

    You can use gen or g as an abbreviation

    for generate

    If the expression is an equality or inequality,

    the variable will take the values 0 if the

    expression is false and 1 if it is true

    If you use if, the new variable will have

    missing values when the if statement is false

  • 8/3/2019 Stata Training 1

    45/58

    For example,

    generate age2 = age*age

    create age squared variable

    gen yield = outputkg/area if area>0

    create new yield variable if area is positive

    gen price = value/quant if quant>0

    create new price variable if quant is positive

    gen highprice = (price>1000)

    creates a dummy variable equal to 1 for high prices

  • 8/3/2019 Stata Training 1

    46/58

    replace

    This command is used to change the definition of

    an existing variable.

    The syntax is the same:

    replace oldvar = exp [if exp] [in exp]

  • 8/3/2019 Stata Training 1

    47/58

    For example,

    replace price = avgprice if price > 100000

    replaces high values with an average price

    replace income =. if income

  • 8/3/2019 Stata Training 1

    48/58

    tabulate generate

    This command is useful for creating a set of

    dummy variables (variables with a value of 0 or 1)

    depending on the value of an existing categoricalvariable.

    The syntax is:

    tabulate oldvariable, generate(newvariable)

  • 8/3/2019 Stata Training 1

    49/58

    tab q1a, gen(region)

    This creates 6 new variables:

    region1=1 if q1a=1 and 0 otherwise

    region2 =1 if q1a =3 and 0 otherwise

    region8=1 if q1a =8 and 0 otherwise

  • 8/3/2019 Stata Training 1

    50/58

    egen

    This is an extended version of

    generate[extended generate] to create a new

    variable by aggregating the existing data.

    The syntax is:

    egen newvar = fcn(arguments) [if exp] [in range] , by(var)

  • 8/3/2019 Stata Training 1

    51/58

    count() number of non-missing values

    diff() compares variables, 1 if different, 0 otherwise

    fill() fill with a pattern

    group() creates a group id from a list of variables

    iqr() interquartile range

    ma() moving average

    max() maximum value

    mean() mean

    median() median

    min() minimum value

    pctile() percentile

    rank () rank

    rmean() mean across variables

    sd () standard deviation

    std() standardize variables

    sum () sums

  • 8/3/2019 Stata Training 1

    52/58

    egen avg = mean(yield)

    creates variable of average yield over

    entire sample

    egen avg2 = median(income), by(sex)

    creates variable of median income for each

    sex

    egen regprod = sum(prod), by(region)

    creates variable of total production for

    each region

  • 8/3/2019 Stata Training 1

    53/58

    Exercise,

    we want to know which households haveexpenditure (cons) above the village average.

    I.e. Create a dummy (1 for those whoconsume above the village/peasant

    association and 0 otherwise)

  • 8/3/2019 Stata Training 1

    54/58

    egen avecon=mean(cons), by( q1c)

    gen highavecon=(cons> avecon)

    list hhid q1c cons avecon highavecon in650/675

  • 8/3/2019 Stata Training 1

    55/58

    Arithmetic

    + addition

    - subtraction

    * multiplication/ division

    ^ power

    Logical

    ~ not

    | or

    & and

    Relational

    > greater than

    < less than

    >= more than or equal

  • 8/3/2019 Stata Training 1

    56/58

    Here are some examples to illustrate the use

    of these operators. Suppose you want you

    create a

    dummy variable indicating households in the

    Amhara region. One way is to write:

    generate AmD = 0

    replace AmD = 1 if q1a==3 Or you can get exactly the same result with just

    one command:

    generate AmD = (q1a==3)

  • 8/3/2019 Stata Training 1

    57/58

    For example, a household head must be

    female head and in Dodota wereda to be

    selected.

    gen DDfemale = 0

    replace DDfemale = 1 if q1b==9 & sexh==0

    or an easier way to do this would be:

    gen DDfemale = (q1b==9 & sexh==0)

  • 8/3/2019 Stata Training 1

    58/58

    abs(x) computes the absolute value of xexp(x) calculates e to the x power.

    ln(x) computes the natural logarithm of xlog(x) is a synonym for ln(x), the natural logarithm.

    log10(x) computes the log base 10 of x.sqrt(x) computes the square root of x.

    invnorm(p) provides the inverse cumulative normal; invnorm(norm(z)) = z.

    normden(z) provides the standard normal density.normden(z,s) provides the normal density. normden(z,s) = normden(z)/s if s>0 and s not

    missing, otherwise, the result is missing.

    norm(z) provides the cumulative standard normal.group(x) creates a categorical variable that divides the data into x as nearly equal-

    sized subsamples as possible, numbering the first group 1, the secondgroup 2, etc. It uses the current order of the data.

    int(x) gives the integer obtained by truncating x.round(x,y) gives x rounded into units of y.