Administration Procedures

Adding a DNS Slave to DMS

Please refer to the Debian Install Documentation, Setting up a Slave DNS Server

Break Fix Scenarios

Log and Configuration Files

The following are detailed elsewhere in the documentation

/var/log/dms/dmsdmd.log* dmsdmd logs
/var/log/local7.log DMS named logs
/var/log/syslog Basically everything
/etc/dms/dms.conf dmsdmd, WSGI and zone_tool configuration file
/etc/dms Various passwords, templates and things

See Named.conf and Zone Templating for more details.

Checking DMS Status

  1. Check that named, postgres, and dmsdmd are running on the master.

  2. Using zone_tool show_dms_status on master server:

       zone_tool > show_dms_status
    
       show_master_status:
    
                   MASTER_SERVER:         dms-akl
    
                   NAMED master configuration state:
    
                   hold_sg:               HOLD_SG_NONE
                   hold_sg_name:          None
                   hold_start:            Wed Nov 7 16:52:36 2012
                   hold_stop:             Wed Nov 7 17:02:36 2012
                   replica_sg_name:       vygr-replica
                   state:                 HOLD
    
       show_replica_sg:
               sg_name:                      vygr-replica
               config_dir:                   /etc/net24/server-config-templates
               master_address:               2406:1e00:1001:1::2
               master_alt_address:           2406:3e00:1001:1::2
               replica_sg:                   True
               zone_count:                   37
    
                   Replica SG named status:
                   dms-chc                             2406:3e00:1001:1::2
    
                            OK
    
        ls_server:
        dms-akl                       Wed Nov 7 16:52:46 2012                OK
                2406:1e00:1001:1::2                      None
                ping: 5 packets transmitted, 5 received, 0.00% packet loss
        dms-chc                       Wed Nov 7 16:52:46 2012                OK
                2406:3e00:1001:1::2                      210.5.48.242
                ping: 5 packets transmitted, 5 received, 0.00% packet loss
        dms-s1-akl                    Wed Nov 7 16:31:04 2012                RETRY
                2406:1e00:1001:2::2                      103.4.136.226
                ping: 5 packets transmitted, 5 received, 0.00% packet loss
                retry_msg:
                   Server 'dms-s1-akl': SOA query - timeout waiting for
                   response, retrying
        dms-s1-chc                    Wed Nov 7 16:52:46 2012                OK
                2406:3e00:1001:2::2                      210.5.48.226
                ping: 5 packets transmitted, 5 received, 0.00% packet loss
    
        list_pending_events:
        ServerSMConfigure         dms-s1-akl                   Wed Nov   7 16:57:22
        2012
        ServerSMCheckServer       dms-chc                      Wed Nov   7 16:53:55
        2012
        ServerSMCheckServer       dms-akl                      Wed Nov   7 16:55:46
        2012
        ServerSMCheckServer       dms-s1-chc                   Wed Nov   7 16:57:06
        2012
        MasterSMHoldTimeout                                    Wed Nov   7 17:02:36
        2012
    
        zone_tool >
    
    * Check Master server name, that machine is actually the master.
    * Check master state, ``HOLD`` means named reconfigured in the last 10
      minutes.
    * All servers shown at bottom should be in ``OK`` or ``CONFIG`` states,
      staying in ``RETRY`` or ``BROKEN`` means server may not be contactable.
      ``RETRY`` or ``BROKEN`` states should also have a ``retry_msg:`` field
      giving the associated log message.
    * :command:`list_pending_events` shows events that have to be processed.
    * Any events that are scheduled in the past may indicate :command:`dmsdmd` having
      serious problems.
    

Failing Over as Master Server has Burned (or Subject to EQC Claim)

On the Replica:

dms-chc: -root- [~]
# dms_promote_replica
+ perl -pe s/^#(\s*local7.* :ompgsql:\S+,dms,rsyslog,.*$)/\1/ -i
/etc/rsyslog.d/pgsql.conf
+ set +x
[ ok ] Stopping enhanced syslogd: rsyslogd.
[ ok ] Starting enhanced syslogd: rsyslogd.
+ perl -pe s/^NET24DMD_ENABLE=.*$/NET24DMD_ENABLE=true/ -i
/etc/default/net24dmd
+ perl -pe s/^OPTIONS=.*$/OPTIONS="-u bind"/ -i /etc/default/bind9
+ set +x
[....] Stopping domain name service...: bind9waiting for pid 8744 to die
. ok
[ ok ] Starting domain name service...: bind9.
[ ok ] Starting net24dmd: net24dmd.
+ zone_tool write_rndc_conf
+ zone_tool reconfig_all
+ perl -pe s/^#+(.*zone_tool vacuum_all)$/\1/ -i /etc/cron.d/dms-core
+ do_dms_wsgi
+ return 0
+ perl -pe s/^(\s*exit\s+0.*$)/#\1/ -i /etc/default/apache2
+ set +x
[ ok ] Starting web server: apache2.

dms-chc: -root- [~]
#

Wait till servers started, and then use zone_tool show_dms_status to check that everything becomes OK. This may take 15 minutes. The section about ls_pending_events will give scheduled times for servers to become configured.

dms-chc: -root- [~]
# zone_tool show_dms_status

show_master_status:

  MASTER_SERVER:      dms-chc

  NAMED master configuration state:

  hold_sg:            HOLD_SG_NONE
  hold_sg_name:       None
  hold_start:         Fri Nov 9 08:30:49 2012
  hold_stop:          Fri Nov 9 08:40:49 2012
  replica_sg_name:    vygr-replica
  state:              HOLD

  show_replica_sg:
          sg_name:              vygr-replica
          config_dir:           /etc/net24/server-config-templates
          master_address:       2406:1e00:1001:1::2
          master_alt_address:   2406:3e00:1001:1::2
          replica_sg:           True
          zone_count:           37

          Replica SG named status:
          dms-akl                         2406:1e00:1001:1::2

                  RETRY

  ls_server:
  dms-akl                       Fri Nov 9 08:23:08 2012                  RETRY
          2406:1e00:1001:1::2                      None
          ping: 5 packets transmitted, 5 received, 0.00% packet loss
          retry_msg:
             Server 'dms-akl': SOA query - timeout waiting for response,
             retrying
  dms-chc                       Fri Nov 9 08:30:58 2012                  OK
          2406:3e00:1001:1::2                      210.5.48.242
          ping: 5 packets transmitted, 5 received, 0.00% packet loss
  dms-s1-akl                    Fri Nov 9 08:30:58 2012                  OK
          2406:1e00:1001:2::2                      103.4.136.226
          ping: 5 packets transmitted, 5 received, 0.00% packet loss
  dms-s1-chc                    Fri Nov 9 08:30:58 2012                  OK
          2406:3e00:1001:2::2                      210.5.48.226
          ping: 5 packets transmitted, 5 received, 0.00% packet loss

  list_pending_events:
  ServerSMCheckServer        dms-chc                        Fri Nov   9 08:39:53
  2012
  MasterSMHoldTimeout                                       Fri Nov   9 08:40:49
  2012
  ServerSMCheckServer        dms-s1-chc                     Fri Nov   9 08:40:08
  2012
  ServerSMCheckServer        dms-s1-akl                     Fri Nov   9 08:36:01
  2012
  ServerSMConfigure          dms-akl                        Fri Nov   9 08:50:17
  2012


dms-chc: -root- [~]
#

A new replica will need to be installed as per DMS Master Server Install

Stuck Zone not Propagating

zone_tool > show_zonesm wham-blam.org
        name:            wham-blam.org.
        alt_sg_name:     None
        auto_dnssec:     False
        ctime:           Thu Aug 23 10:51:14 2012
        deleted_start:   None
        edit_lock:       True
        edit_lock_token: None
        inc_updates:     False
        lock_state:      EDIT_UNLOCK
        locked_at:       None
        locked_by:       None
        mtime:           Thu Aug 23 10:51:14 2012
        nsec3:           True
        reference:       nutty-nutty@ANATHOTH-NET
        sg_name:         anathoth-internal
        soa_serial:      2012091400
        state:           UNCONFIG
        use_apex_ns:     True
        zi_candidate_id: 102880
        zi_id:           102880
        zone_id:         101448
        zone_type:       DynDNSZoneSM

           zi_id:                  102880
           change_by:              grantma@shalom-ext.internal.anathoth.net/Admin
           ctime:                  Fri Sep 14 10:55:59 2012
           mtime:                  Fri Sep 14 11:12:10 2012
           ptime:                  Fri Sep 14 11:12:10 2012
           soa_expire:             7d
           soa_minimum:            600
           soa_mname:              ns1.internal.anathoth.net.
           soa_refresh:            24h
           soa_retry:              900
           soa_rname:              matthewgrant5.gmail.com.
           soa_serial:             2012091400
           soa_ttl:                None
           zone_id:                101448
           zone_ttl:               24h

Maybe as above. Can be caused by:

  • Failed events (manually failed or otherwise, Events queue deleted in DB, permissions problems as follows)

  • Permissions problems on the master server on the /var/lib/bind/dynamic directory - should be:

    # ls -ld /var/lib/bind/dynamic/
    drwxrwsr-x 2 bind dmsdmd 487424 Nov                9 08:47 /var/lib/bind/dynamic/
    

Do a reset_zonesm wham-blam.org, (noting y/N and DNSSEC RRSIGs being destroyed):

zone_tool > reset_zonesm wham-blam.org.
***   WARNING - doing this destroys DNSSEC RRSIG data.
***   Do really you wish to do this?
        --y/[N]> y
zone_tool >

And check again:

zone_tool > show_zonesm wham-blam.org
        name:            wham-blam.org.
        alt_sg_name:     None
        auto_dnssec:     False
        ctime:           Thu Aug 23 10:51:14 2012
        deleted_start:   None
        edit_lock:       True
        edit_lock_token: None
        inc_updates:     False
        lock_state:      EDIT_UNLOCK
        locked_at:       None
        locked_by:       None
        mtime:           Thu Aug 23 10:51:14 2012
        nsec3:           True
        reference:       nutty-nutty@ANATHOTH-NET
        sg_name:         anathoth-internal
        soa_serial:      2012091400
        state:           RESET
        use_apex_ns:     True
        zi_candidate_id: 102880
        zi_id:           102880
        zone_id:         101448
        zone_type:       DynDNSZoneSM

          zi_id:                 102880
          change_by:             grantma@shalom-ext.internal.anathoth.net/Admin
          ctime:                 Fri Sep 14 10:55:59 2012
          mtime:                 Fri Sep 14 11:12:10 2012
          ptime:                 Fri Sep 14 11:12:10 2012
          soa_expire:            7d
          soa_minimum:           600
          soa_mname:             ns1.internal.anathoth.net.
          soa_refresh:           24h
          soa_retry:             900
          soa_rname:             matthewgrant5.gmail.com.
          soa_serial:            2012091400
          soa_ttl:               None
          zone_id:               101448
          zone_ttl:              24h

And then use show_zonesm to check that zone state goes to PUBLISHED within 15 minutes. ls_pending_events may also be useful, as it will show the events to do with the zone being published.

show_zonesm wham-blam.org
        name:             wham-blam.org.
        alt_sg_name:      None
        auto_dnssec:      False
        ctime:            Thu Aug 23 10:51:14 2012
        deleted_start:    None
        edit_lock:        True
        edit_lock_token: None
        inc_updates:      False
        lock_state:       EDIT_UNLOCK
        locked_at:        None
        locked_by:        None
        mtime:            Thu Aug 23 10:51:14 2012
        nsec3:            True
        reference:        nutty-nutty@ANATHOTH-NET
        sg_name:          anathoth-internal
        soa_serial:       2012091400
        state:            RESET
        use_apex_ns:      True
        zi_candidate_id: 102880
        zi_id:            102880
        zone_id:          101448
        zone_type:        DynDNSZoneSM

        zi_id:           102880
        change_by:       grantma@shalom-ext.internal.anathoth.net/Admin
        ctime:           Fri Sep 14 10:55:59 2012
        mtime:           Fri Sep 14 11:12:10 2012
        ptime:           Fri Sep 14 11:12:10 2012
        soa_expire:      7d
        soa_minimum:     600
        soa_mname:       ns1.internal.anathoth.net.
        soa_refresh:     24h
        soa_retry:       900
        soa_rname:       matthewgrant5.gmail.com.
        soa_serial:      2012091400
        soa_ttl:         None
        zone_id:         101448
        zone_ttl:        24h
zone_tool > ls_pending_events
ServerSMCheckServer       shalom                       Fri Nov 9 08:50:35
2012
ServerSMCheckServer       shalom-ext                   Fri Nov 9 08:50:40
2012
ServerSMCheckServer       shalom-dr                    Fri Nov 9 08:50:46
2012
ServerSMCheckServer       dns-slave1                   Fri Nov 9 08:50:53
2012
ServerSMConfigure         en-gedi-auth                 Fri Nov 9 08:55:31
2012

ZoneSMConfig              wham-blam.org.              Fri Nov   9 08:47:07
2012
MasterSMHoldTimeout                                   Fri Nov   9 08:56:52
2012
ServerSMCheckServer       dns-slave0                  Fri Nov   9 08:54:29
2012
zone_tool > show_zonesm wham-blam.org
        name:            wham-blam.org.
        alt_sg_name:     None
        auto_dnssec:     False
        ctime:           Thu Aug 23 10:51:14 2012
        deleted_start:   None
        edit_lock:       True
        edit_lock_token: None
        inc_updates:     False
        lock_state:      EDIT_UNLOCK
        locked_at:       None
        locked_by:       None
        mtime:           Thu Aug 23 10:51:14 2012
        nsec3:           True
        reference:       nutty-nutty@ANATHOTH-NET
        sg_name:         anathoth-internal
        soa_serial:      2012091400
        state:           UNCONFIG
        use_apex_ns:     True
        zi_candidate_id: 102880
        zi_id:           102880
        zone_id:         101448
        zone_type:       DynDNSZoneSM

        zi_id:           102880
        change_by:       grantma@shalom-ext.internal.anathoth.net/Admin
        ctime:           Fri Sep 14 10:55:59 2012
        mtime:           Fri Sep 14 11:12:10 2012
        ptime:           Fri Sep 14 11:12:10 2012
        soa_expire:      7d
        soa_minimum:     600
        soa_mname:       ns1.internal.anathoth.net.
        soa_refresh:     24h
        soa_retry:       900
        soa_rname:       matthewgrant5.gmail.com.
        soa_serial:      2012091400
        soa_ttl:         None
        zone_id:         101448
        zone_ttl:        24h
zone_tool > ls_pending_events
ServerSMCheckServer       shalom                       Fri Nov 9 08:50:35
2012
ServerSMCheckServer       shalom-ext                   Fri Nov 9 08:50:40
2012
ServerSMCheckServer       shalom-dr                    Fri Nov 9 08:50:46
2012
ServerSMCheckServer       dns-slave1                   Fri Nov 9 08:50:53

2012
ServerSMConfigure         en-gedi-auth                 Fri Nov   9 08:55:31
2012
MasterSMHoldTimeout                                    Fri Nov   9 08:56:52
2012
ServerSMCheckServer       dns-slave0                   Fri Nov   9 08:54:29
2012
ZoneSMReconfigUpdate      wham-blam.org.               Fri Nov   9 08:57:10
2012
zone_tool > ls_pending_events
ServerSMCheckServer       shalom-ext                   Fri Nov   9 09:00:25
2012
ServerSMCheckServer       shalom-dr                    Fri Nov   9 09:00:44
2012
ServerSMCheckServer       dns-slave0                   Fri Nov   9 09:01:25
2012
ServerSMCheckServer       dns-slave1                   Fri Nov   9 09:02:11
2012
ServerSMConfigure         en-gedi-auth                 Fri Nov   9 09:06:15
2012
MasterSMHoldTimeout                                    Fri Nov   9 09:06:57
2012
ServerSMCheckServer       shalom                       Fri Nov   9 09:05:11
2012
zone_tool > show_zonesm wham-blam.org
        name:            wham-blam.org.
        alt_sg_name:     None
        auto_dnssec:     False
        ctime:           Thu Aug 23 10:51:14 2012
        deleted_start:   None
        edit_lock:       True
        edit_lock_token: None
        inc_updates:     False
        lock_state:      EDIT_UNLOCK
        locked_at:       None
        locked_by:       None
        mtime:           Thu Aug 23 10:51:14 2012
        nsec3:           True
        reference:       nutty-nutty@ANATHOTH-NET
        sg_name:         anathoth-internal
        soa_serial:      2012091400
        state:           PUBLISHED
        use_apex_ns:     True
        zi_candidate_id: 102880
        zi_id:           102880
        zone_id:         101448
        zone_type:       DynDNSZoneSM

        zi_id:           102880
        change_by:       grantma@shalom-ext.internal.anathoth.net/Admin
        ctime:           Fri Sep 14 10:55:59 2012
        mtime:           Fri Nov 9 08:57:13 2012
        ptime:           Fri Nov 9 08:57:13 2012

soa_expire:    7d
soa_minimum:   600
soa_mname:     ns1.internal.anathoth.net.
soa_refresh:   24h
soa_retry:     900
soa_rname:     matthewgrant5.gmail.com.
soa_serial:    2012091400
soa_ttl:       None
zone_id:       101448
zone_ttl:                24h
zone_tool >

MasterSM Stuck, New Zones not Being Created

Can be caused by:

  • Failed MasterSMHoldTimeout events (manually failed or otherwise, Events queue deleted in DB etc)

  • Permissions problems on the master server on the /etc/bind/master-config directory - Should be 2755 dmsdmd:bind:

    shalom-ext: -grantma- [~]
    $ ls -ld /etc/bind/master-config
    drwxr-sr-x 2 net24dmd bind 4096 Nov          9 08:56 /etc/bind/master-config
    

This shows up in zone_tool show_dms_status:

zone_tool > show_dms_status

show_master_status:

          MASTER_SERVER:            dms-akl

          NAMED master configuration state:

          hold_sg:                  HOLD_SG_NONE
          hold_sg_name:             None
          hold_start:               Wed Nov 7 16:52:36 2012
          hold_stop:                Wed Nov 7 17:02:36 2012
          replica_sg_name:          vygr-replica
          state:                    HOLD

show_replica_sg:
        sg_name:                       vygr-replica
        config_dir:                    /etc/net24/server-config-templates
        master_address:                2406:1e00:1001:1::2
        master_alt_address:            2406:3e00:1001:1::2
        replica_sg:                    True
        zone_count:                    37

          Replica SG named status:
          dms-chc                                2406:3e00:1001:1::2

                     OK

ls_server:
  dms-akl                      Wed Nov 7 16:52:46 2012                 OK
        2406:1e00:1001:1::2                     None
        ping: 5 packets transmitted, 5 received, 0.00% packet loss
  dms-chc                       Wed Nov 7 16:52:46 2012                OK
          2406:3e00:1001:1::2                      210.5.48.242
          ping: 5 packets transmitted, 5 received, 0.00% packet loss
  dms-s1-akl                    Wed Nov 7 16:31:04 2012                RETRY
          2406:1e00:1001:2::2                      103.4.136.226
          ping: 5 packets transmitted, 5 received, 0.00% packet loss
          retry_msg:
             Server 'dms-s1-akl': SOA query - timeout waiting for
             response, retrying
  dms-s1-chc                    Wed Nov 7 16:52:46 2012                OK
          2406:3e00:1001:2::2                      210.5.48.226
          ping: 5 packets transmitted, 5 received, 0.00% packet loss

  list_pending_events:
  ServerSMConfigure         dms-s1-akl                   Wed Nov   7 16:57:22
  2012
  ServerSMCheckServer       dms-chc                      Wed Nov   7 16:53:55
  2012
  ServerSMCheckServer       dms-akl                      Wed Nov   7 16:55:46
  2012
  ServerSMCheckServer       dms-s1-chc                   Wed Nov   7 16:57:06
  2012


  zone_tool > exit

  dms-akl: -root- [~]
  # date
  Wed Nov      7 16:54:42 NZDT 2012

Key things to look for:

  • master status section shows hold_start and hold_stop being in the past
  • there is no MasterSMHoldTimeout event

Note

The MasterSM state machine forward posts the MasterSMHoldTimeout event when entering the HOLD state. If it does not get created or disappears or fails due to unforeseen events with outages etc, the MasterSM will end up stuck as above.

The fix is to do zone_tool reset_master. This will reset the MasterSM state machine.

Stuck ServerSM

Just like the Master state machine getting stuck because of a missing MasterSMHoldTimeout event, Server SMs can end up being stuck in the CONFIG, RETRY or BROKEN states due to missing events. There will be missing ServerSMConfigure events for the server in the ls_pending_events output:

zone_tool > show_dms_status
show_master_status:
        MASTER_SERVER:      shalom-ext
        NAMED master configuration state:
        hold_sg:            HOLD_SG_NONE
        hold_sg_name:       None
        hold_start:         None
        hold_stop:          None
        replica_sg_name:    anathoth-replica
        state:              READY
show_replica_sg:
        sg_name:              anathoth-replica
        config_dir:           /etc/bind/anathoth-master
        master_address:       2001:470:f012:2::2
        master_alt_address: 2001:470:f012:2::3
        replica_sg:           True
        zone_count:           14
        Replica SG named status:
        shalom-dr                      2001:470:f012:2::3
                 OK
ls_server:
dns-slave0                    Fri Nov 9 09:56:48 2012                  OK
        2001:470:c:110e::2                        111.65.238.10
        ping: 5 packets transmitted, 5 received, 0.00% packet loss
dns-slave1                    Fri Nov 9 09:56:38 2012                  OK
        2001:470:66:23::2                         111.65.238.11
        ping: 5 packets transmitted, 5 received, 0.00% packet loss
en-gedi-auth                  Thu Nov 8 18:01:07 2012                  RETRY
        fd14:828:ba69:6:5054:ff:fe39:54f9         172.31.12.2
        ping: 5 packets transmitted, 0 received, 100.00% packet loss
        retry_msg:
           Server 'en-gedi-auth': failed to rsync include files,
           Command '['rsync', '--quiet', '-av', '--password-file',
           '/etc/net24/rsync-dnsconf-password', '/var/lib/net24/dms-sg
           /anathoth-internal/',
           'dnsconf@[fd14:828:ba69:6:5054:ff:fe39:54f9]::dnsconf/']'
           returned non-zero exit status 10, rsync: failed to connect
           to fd14:828:ba69:6:5054:ff:fe39:54f9
           (fd14:828:ba69:6:5054:ff:fe39:54f9): Connection timed out
           (110), rsync error: error in socket IO (code 10) at
           clientserver.c(122) [sender=3.0.9]
shalom                        Fri Nov 9 09:56:19 2012                  OK
        fd14:828:ba69:1:21c:f0ff:fefa:f3c0        192.168.110.1
        ping: 5 packets transmitted, 5 received, 0.00% packet loss
shalom-dr                     Fri Nov 9 09:56:56 2012                  OK
        2001:470:f012:2::3                        172.31.10.4
        ping: 5 packets transmitted, 5 received, 0.00% packet loss
shalom-ext                    Fri Nov 9 09:58:21 2012                  OK
        2001:470:f012:2::2                        172.31.10.2
        ping: 5 packets transmitted, 5 received, 0.00% packet loss
list_pending_events:
ServerSMCheckServer        shalom                         Fri Nov 9 10:01:43   2012
ServerSMCheckServer        dns-slave1                     Fri Nov 9 10:01:55   2012
ServerSMCheckServer        dns-slave0                     Fri Nov 9 10:03:17   2012
ServerSMCheckServer        shalom-dr                      Fri Nov 9 10:05:25   2012
ServerSMCheckServer        shalom-ext                     Fri Nov 9 10:04:49   2012
zone_tool >

Note

Above, the ls_server section of show_dms_status displays the reason for going to RETRY or BROKEN in the displayed retry_msg field.

The fix, reset_server the server, and use ls_pending_events to check an ServerSMConfigure event is created:

zone_tool > reset_server en-gedi-auth
zone_tool > ls_pending_events
ServerSMCheckServer       shalom                                  Fri   Nov   9   12:11:17   2012
ServerSMCheckServer       shalom-ext                              Fri   Nov   9   12:11:47   2012
ServerSMCheckServer       en-gedi-auth                            Fri   Nov   9   12:14:57   2012
ServerSMCheckServer       dns-slave0                              Fri   Nov   9   12:18:02   2012
ServerSMCheckServer       shalom-dr                               Fri   Nov   9   12:15:09   2012
ServerSMCheckServer       dns-slave1                              Fri   Nov   9   12:19:08   2012
ServerSMConfigure         en-gedi-auth                            Fri   Nov   9   12:10:39   2012
zone_tool >

Wait until the scheduled time posted for ServerSMConfigure, and then do a zone_tool show_dms_status to make sure everything is going:

zone_tool > show_dms_status
show_master_status:
        MASTER_SERVER:      shalom-ext
        NAMED master configuration state:
        hold_sg:            HOLD_SG_NONE
        hold_sg_name:       None
        hold_start:         None
        hold_stop:          None
        replica_sg_name:    anathoth-replica
        state:              READY
show_replica_sg:
        sg_name:              anathoth-replica
        config_dir:           /etc/bind/anathoth-master
        master_address:       2001:470:f012:2::2
        master_alt_address: 2001:470:f012:2::3
        replica_sg:           True
        zone_count:           14
        Replica SG named status:
        shalom-dr                      2001:470:f012:2::3
                 OK
ls_server:
dns-slave0                    Fri Nov 9 12:08:29 2012                  OK
        2001:470:c:110e::2                        111.65.238.10
        ping: 5 packets transmitted, 5 received, 0.00% packet loss
dns-slave1                    Fri Nov 9 12:10:19 2012                  OK
        2001:470:66:23::2                         111.65.238.11
        ping: 5 packets transmitted, 5 received, 0.00% packet loss
en-gedi-auth                  Fri Nov 9 12:10:43 2012                  OK
        fd14:828:ba69:6:5054:ff:fe39:54f9         172.31.12.2
        ping: 5 packets transmitted, 5 received, 0.00% packet loss
shalom                        Fri Nov 9 12:11:19 2012                  OK
        fd14:828:ba69:1:21c:f0ff:fefa:f3c0        192.168.110.1
        ping: 5 packets transmitted, 5 received, 0.00% packet loss
shalom-dr                     Fri Nov 9 12:09:44 2012                  OK
        2001:470:f012:2::3                        172.31.10.4
        ping: 5 packets transmitted, 5 received, 0.00% packet loss
shalom-ext                    Fri Nov 9 12:11:47 2012                  OK
        2001:470:f012:2::2                        172.31.10.2
        ping: 5 packets transmitted, 5 received, 0.00% packet loss
list_pending_events:
ServerSMCheckServer        en-gedi-auth                   Fri Nov 9 12:14:57   2012
ServerSMCheckServer        dns-slave0                     Fri Nov 9 12:18:02   2012
ServerSMCheckServer        shalom-dr                      Fri Nov 9 12:15:09   2012
ServerSMCheckServer        dns-slave1                     Fri Nov 9 12:19:08   2012
ServerSMCheckServer        shalom                         Fri Nov 9 12:17:44   2012
ServerSMCheckServer        shalom-ext                     Fri Nov 9 12:17:31   2012
zone_tool >

Rebuilding named data from database

The named dynamic data in /var/lib/bind/dynamic is corrupt, or missing

  1. Stop named and dmsdmd:

    root@dms3-master:~# service bind9 stop
    [....] Stopping domain name service...: bind9waiting for pid 15462 to die
    . ok
    root@dms3-master:~# service net24dmd stop
    [ ok ] Stopping net24dmd: net24dmd.
    
  2. Check /var/lib/dms/master_config and /var/lib/bind/dynamic permissions.

    /var/lib/dms/master-config, should be 2755 dmsdmd:bind:

          root@dms3-master:~# ls -ld /var/lib/dms/master-config/
          drwxr-sr-x 2 dmsdmd bind 4096 Nov 9 12:39 /var/lib/dms/master-config/
          root@dms3-master:~#
    
    :file:`/var/lib/bind/dynamic`, should be ``2775 bind:dmsdmd``::
    
    
          root@dms3-master:~# ls -ld /var/lib/bind/dynamic
          drwxrwsr-x 2 bind dmsdmd 1683456 Nov 9 12:39 /var/lib/bind/dynamic
          root@dms3-master:~#
    
  3. Clear any files from /var/lib/bind/dynamic if needed:

    root@dms3-master:~# rm -rf /var/lib/bind/dynamic/*
    root@dms3-master:~#
    
  4. Run the restore process which recreates /etc/bind/master-config/ contents, and recreates contents of /var/lib/bind/dynamic. This may take some time. 40000 zones takes 20 - 30 minutes.

    root@dms3-master:~# zone_tool restore_named_db
    ***   WARNING - doing this destroys DNSSEC RRSIG data. It is a last
          resort in DR recovery.
    ***   Do really you wish to do this?
     --y/[N]> y
    
  5. Start named and dmsdmd:

    root@dms3-master:~# service dmsdmd start
    [ ok ] Starting dmsdmd: dmsdmd.
    root@dms3-master:~# service bind9 start
    [ ok ] Starting domain name service...: bind9.
    root@dms3-master:~#
    

Failed Master, Replica /etc not up to date

The master and DR replica have the etckeeper git archive mirrored every 4 hours to the alternate server. See etckeeper and /etc on Replica and Master Servers

Recovering DB from Backup

/etc/cron.d/dms-core does daily FULL pg_dumpall to /var/backups/postresql-9.1-dms.sql.gz, on replica and master, which are rotated for 7 days.

To recover:

# cd /var/backups
# gunzip -c postregresql-9.1-dms.sql.gz | psql -U pgsql

There will be lots of SQL output. The dump also contains DB user passwords, and ACL/permissions information, along with DB details for the whole PostgresQL ‘dms’ cluster.

Regenerating ds/ DS material directory from Private Keys

Use the dns-recreateds command to recreate a domains DNSSEC DS material. The /var/lib/bind/keys directory is rsynced to the DR replica by the master server dmsdmd daemon. Use a ‘*’ argument to regenerate all DS material.

shalom-ext: -root- [/var/lib/bind/keys]
# dns-recreateds anathoth.net
+ dnssec-dsfromkey -2 /var/lib/bind/keys/Kanathoth.net.+007+57318.key
+ set +x
shalom-ext: -root- [/var/lib/bind/keys]
#

IPSEC not going

These examples are between DNS slave server dns-slave1 and master shalom-ext, using racoon, via racoon-tool in Debian Wheezy.

Note

The ICMPv6 setup is specific to this Debian Wheezy racoon setup. However, the test techniques are also applicable to usewith Strongswan and other IPSEC software.

Diagnosis

Ping6 server from master and vice-versa to check unencrypted network level. (Transport mode encryption does not encrypt ICMPv6). Use the zone_tool ls_server -v command to get the DMS configured IPv6 addresses of both servers.

shalom-ext: -grantma- [~/dms]
$ zone_tool ls_server -v dns-slave1
dns-slave1 Mon Nov 12 13:57:20 2012 OK
 2001:470:66:23::2 111.65.238.11

shalom-ext: -grantma- [~/dms]
$ zone_tool ls_server -v shalom-ext
shalom-ext                    Mon Nov 12 13:59:29 2012                           OK
        2001:470:f012:2::2                       172.31.10.2
shalom-ext: -grantma- [~/dms]
$ ping6 2001:470:66:23::2
PING 2001:470:66:23::2(2001:470:66:23::2) 56 data bytes
64 bytes from 2001:470:66:23::2: icmp_seq=1 ttl=58 time=312 ms
64 bytes from 2001:470:66:23::2: icmp_seq=2 ttl=58 time=310 ms
64 bytes from 2001:470:66:23::2: icmp_seq=3 ttl=58 time=310 ms
^C
--- 2001:470:66:23::2 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2003ms
rtt min/avg/max/mdev = 310.646/311.293/312.518/0.866 ms
shalom-ext: -grantma- [~/dms]
$

Telnet domain TCP ports both ways, and rsync out to slave server from master. This checks that IPSEC encryption is running.

From shalom-ext:

shalom-ext: -grantma- [~/dms]
$ telnet 2001:470:66:23::2 53
Trying 2001:470:66:23::2...
Connected to 2001:470:66:23::2.
Escape character is '^]'.
^]c
telnet> c
Connection closed.
shalom-ext: -grantma- [~/dms]
$ telnet 2001:470:66:23::2 rsync
Trying 2001:470:66:23::2...
Connected to 2001:470:66:23::2.
Escape character is '^]'.
@RSYNCD: 30.0
^]c
telnet> c
Connection closed.
shalom-ext: -grantma- [~/dms]
$

From dns-slave1:

grantma@dns-slave1:~$ telnet 2001:470:f012:2::2 53
Trying 2001:470:f012:2::2...
Connected to 2001:470:f012:2::2.
Escape character is '^]'.
^]c
telnet> c
Connection closed.
grantma@dns-slave1:~$

If the DNS server is a DR replica, telnet the rsync port the other way also.

Recovery

For racoon and strongswan, if things are not working restart the IPSEC connection at both ends:

Note

For Strongswan, use the ipsec up/down <connection-name>. ipsec status [<connection-name>] can be used to list all connections, and query about status.

racoon shalom-ext master:

shalom-ext: -root- [/home/grantma/dms]
# racoon-tool vlist
shalom-dr
dns-slave1
%anonymous
shalom-ext
shalom
dns-slave0
en-gedi-auth
shalom-ext: -root- [/home/grantma/dms]
# racoon-tool vreload dns-slave1
Reloading VPN dns-slave1...The result of line 2: No entry.
The result of line 5: No entry.
done.
shalom-ext: -root- [/home/grantma/dms]
#

racoon dns-slave1:

root@dns-slave1:/home/grantma# racoon-tool vlist
shalom-dr
%anonymous
shalom-ext
root@dns-slave1:/home/grantma# racoon-tool vreload shalom-ext
Reloading VPN shalom-ext...The result of line 2: No entry.
The result of line 5: No entry.
done.
root@dns-slave1:/home/grantma#

Note

Wait 10 minutes for IPSEC replay timing to expire. Then retry the telnet steps above.

If IPSEC still will not work:

For racoon, Use racoon-tool restart on both ends. For strongswan, use ipsec restart on both ends.

shalom-ext:

shalom-ext: -root- [/home/grantma/dms]
# racoon-tool restart
Stopping IKE (ISAKMP/Oakley) server: racoon.
Flushing SAD and SPD...
SAD and SPD flushed.
Unloading IPSEC/crypto modules...
IPSEC/crypto modules unloaded.
Loading IPSEC/crypto modules...
IPSEC/crypto modules loaded.
Flushing SAD and SPD...
SAD and SPD flushed.
Loading SAD and SPD...
SAD and SPD loaded.
Configuring racoon...done.
Starting IKE (ISAKMP/Oakley) server: racoon.
shalom-ext: -root- [/home/grantma/dms]
#

dns-slave1:

root@dns-slave1:/home/grantma# racoon-tool restart
Stopping IKE (ISAKMP/Oakley) server: racoon.
Flushing SAD and SPD...
SAD and SPD flushed.
Unloading IPSEC/crypto modules...
IPSEC/crypto modules unloaded.
Loading IPSEC/crypto modules...
IPSEC/crypto modules loaded.
Flushing SAD and SPD...
SAD and SPD flushed.
Loading SAD and SPD...
SAD and SPD loaded.
Configuring racoon...done.
Starting IKE (ISAKMP/Oakley) server: racoon.
root@dns-slave1:/home/grantma#

Note

Wait 10 minutes for IPSEC replay timing to expire. Then retry the telnet steps above.

DMS Master Server Install

Base Operating System: Debian Wheezy or later.

Create /etc/apt/apt.conf.d/00local.conf:

// No point in installing a lot of fat on VM servers
APT::Install-Recommends "0";
APT::Install-Suggests "0";

Create /etc/apt/sources.list.d/00local.conf:

deb http://deb-repo.devel.net.nz/debian/ wheezy main
deb-src http://deb-repo.devel.net.nz/debian/ wheezy main

Install these packages:

# apt-get install cron-apt screen tree procps psmisc sysstat sudo lsof open-vm-tools open-vm-dkms dms

If using netscript-2.4 instead of ifupdown to properly install it because of cyclic boot dependencies (I will look into this when have some spare time, and log an RC Debian bug):

# dpkg --force --purge ifupdown
# apt-get -f install

Further, for netscript-2.4, edit /etc/netscript/network.conf to configure static addressing. Look for IF_AUTO, set eth0_IPADDR, and further down comment out eth_start and eth_stop functions to turn off DHCP.

Note

For most setups, netscript-ipfilter is a suitable package for managing Linux filtering rules without replacing ifupdown.

Netscript-2.4 and netscript-ipfilter manage iptables and ip6tables via iptables-save/iptables-restore, and keeps a cyclic history which you can change back to if your filter changes go wrong via netscript ipfilter/ip6filter save/usebackup.

Then:

# aptitude update
# aptitude upgrade

shell.tar.gz

Note

This is just a personal Debian prompt thing of mine. You might say I get too personal…

To fix shell prompt for larger terminals on master server makes typing in long zone_tool commands at shell a lot clearer:

# tar -C / -xzf shell.tar.gz

Replaces /etc/skel shell and /root dot files with single line feed to force use of file in /etc

Then edit /etc/environment.sh to turn off various things like umask 00002 for user id less than 1000.

Completing DMS Setup

Then follow Debian Install documentation.