Network Working Group                                         S. Leonard
Internet-Draft                                             Penango, Inc.
Intended Status: Best Current Practice                    March 21, 2016
Expires: September 22, 2016                                             


                 Regular Expressions for Internet Mail
                     draft-seantek-mail-regexen-00
                                    
Abstract

   Internet Mail identifiers are used ubiquitously throughout computing
   systems as building blocks of online identity. Unfortunately,
   incomplete understandings of the syntaxes of these identifiers has
   led to interoperability problems and poor user experiences. Many
   users use specific characters in their addresses that are not
   properly accepted on various systems. This document prescribes
   normative regular expression (regex) patterns for all Internet-
   connected systems to use when validating or parsing Internet Mail
   identifiers, with special attention to regular expressions that work
   with popular languages and platforms.

Status of This Memo

   This Internet-Draft is submitted in full conformance with the
   provisions of BCP 78 and BCP 79.

   Internet-Drafts are working documents of the Internet Engineering
   Task Force (IETF). Note that other groups may also distribute working
   documents as Internet-Drafts. The list of current Internet-Drafts is
   at http://datatracker.ietf.org/drafts/current/.

   Internet-Drafts are draft documents valid for a maximum of six months
   and may be updated, replaced, or obsoleted by other documents at any
   time. It is inappropriate to use Internet-Drafts as reference
   material or to cite them other than as "work in progress."

   This Internet-Draft will expire on September 22, 2016.

Copyright Notice

   Copyright (c) 2016 IETF Trust and the persons identified as the
   document authors. All rights reserved.

   This document is subject to BCP 78 and the IETF Trust's Legal
   Provisions Relating to IETF Documents
   (http://trustee.ietf.org/license-info) in effect on the date of
   publication of this document. Please review these documents
 


Leonard                  Best Current Practice                  [Page 1]

Internet-Draft   Regular Expressions for Internet Mail    March 21, 2016


   carefully, as they describe your rights and restrictions with respect
   to this document. Code Components extracted from this document must
   include Simplified BSD License text as described in Section 4.e of
   the Trust Legal Provisions and are provided without warranty as
   described in the Simplified BSD License.

Table of Contents

   1.  Introduction . . . . . . . . . . . . . . . . . . . . . . . . .  2
     1.2. Normative Effects . . . . . . . . . . . . . . . . . . . . .  4
     1.3. Definitions . . . . . . . . . . . . . . . . . . . . . . . .  4
   2.  History and Formal Models for Internet Mail Identifiers  . . .  6
     2.1. The Core History  . . . . . . . . . . . . . . . . . . . . .  6
     2.2. Multipurpose Internet Mail Extensions and . . . . . . . . .  9
     2.3. Email Address Internationalization  . . . . . . . . . . . .  9
     2.4. The Data Model  . . . . . . . . . . . . . . . . . . . . . .  9
       2.4.1 Email Address  . . . . . . . . . . . . . . . . . . . . . 10
       2.4.2 Message-ID . . . . . . . . . . . . . . . . . . . . . . . 10
     2.5. Equivalence and Comparison  . . . . . . . . . . . . . . . . 10
       2.5.1. Email Address . . . . . . . . . . . . . . . . . . . . . 11
       2.5.2. Message-ID  . . . . . . . . . . . . . . . . . . . . . . 12
   3.  Regular Expressions for Email Addresses  . . . . . . . . . . . 13
     3.1. Deliverable Email Address . . . . . . . . . . . . . . . . . 13
       3.1.1. Basic Rules of Derivation . . . . . . . . . . . . . . . 13
       3.1.2. Complete Expression for Deliverable Email Address . . . 15
       3.1.3. Using Character Classes . . . . . . . . . . . . . . . . 15
       3.1.4. "Flotsam" and "Jetsam" Beyond ASCII . . . . . . . . . . 15
       3.1.3. Certain Expressions for Restrictions  . . . . . . . . . 16
       3.1.4. Unquoting Local-Part  . . . . . . . . . . . . . . . . . 17
       3.1.5. Quoting Local-Part  . . . . . . . . . . . . . . . . . . 17
     3.2. Modern Email Address  . . . . . . . . . . . . . . . . . . . 18
     3.3. Legacy Email Address  . . . . . . . . . . . . . . . . . . . 18
     3.4. Algorithms for Detecting Email Addresses  . . . . . . . . . 18
     3.5. Handling Domain Names . . . . . . . . . . . . . . . . . . . 18
   4.  Regular Expressions for Message-IDs  . . . . . . . . . . . . . 19
   5.  Security Considerations  . . . . . . . . . . . . . . . . . . . 19
   8.  Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 20
   9.  References . . . . . . . . . . . . . . . . . . . . . . . . . . 20
     9.1.  Normative References . . . . . . . . . . . . . . . . . . . 20
     9.2.  Informative References . . . . . . . . . . . . . . . . . . 23
   Appendix A. Test Vectors . . . . . . . . . . . . . . . . . . . . . 24
   Author's Address . . . . . . . . . . . . . . . . . . . . . . . . . 24



1.  Introduction

   Internet Mail is everywhere. This fact of modern connected life is so
 


Leonard                  Best Current Practice                  [Page 2]

Internet-Draft   Regular Expressions for Internet Mail    March 21, 2016


   self-evident that [RFC5598] says: "In practical terms, an email
   address string has become the common identifier for representing
   online identity." [MTECHDEV] acknowledges that email "has been one of
   the major technical and sociological developments of the past 40
   years." Whether it is joining a social network, participating in a
   forum, blogging, paying taxes, buying products, conducting
   professional correspondence, or communicating with loved ones, one's
   email address forms the cornerstone or backstop (frequently both) to
   these methods of communication. Internet mail is not only ubiquitous:
   it is essentially free for all users connected to the Internet.

   Yet it is surprising how fragile or cavalier many systems are with
   their treatment of Internet Mail identifiers, namely with email
   addresses. Prominent government agencies, financial institutions, and
   even major mail services reject a variety of forms that are in wide
   use by many user communities. [[NB: do a survey.]] For example, in
   the intervening time between IETF 94 and the submission of this
   Internet-Draft, the author interacted with not less than 25 different
   web or other Internet-connected services that rejected or mangled his
   perfectly valid email addresses. The result is a pernicious and
   creeping degradation of mail service and of the usability of the
   Internet Mail infrastructure, resulting in undelivered mail,
   misdelivered mail (which can constitute a security vulnerability),
   and denial of service.

   The Internet Mail standards, like the mail system, have evolved over
   time and have been modified to accommodate volumes and scenarios far
   beyond their original design goals. Furthermore, some identifier
   forms have been restricted over time as certain syntaxes were
   determined to be harmful, arcane, or just plain useless. [[So while
   not "blame", some responsibility or causation lies with these
   standards, which go out of their way to balance backwards-
   compatibility, complexity, completeness, and flexibility at the
   expense of a simple and widely-implementable addressing format.]]

   This document prescribes normative regular expression (regex)
   patterns for all Internet-connected systems to use when validating or
   parsing Internet Mail identifiers. Attention specifically focuses on
   "email address" (the specification for the string commonly associated
   with a single mailbox at a single named entity), and Message-ID,
   which share nearly identical syntax, but have different use cases and
   semantics. First, the history of Internet Mail is traced to build a
   coherent data model for Internet Mail identifiers. Second, relevant
   expression formats are discussed. Third, expressions to fit the
   identifiers in a variety of computing contexts are developed and
   presented. The overall goal of this document is to establish cut-and-
   dried algorithms that developers can incorporate directly into their
   mail-using products (including web browsers, form validators, and
 


Leonard                  Best Current Practice                  [Page 3]

Internet-Draft   Regular Expressions for Internet Mail    March 21, 2016


   software libraries), replacing current ad-hoc (and oftentimes
   atrociously inconsistent) approaches with standardized behavior.

1.2. Normative Effects

   This document is proposed as a Best Current Practice [RFC1818] for
   all users and systems of Internet Mail identifiers, which basically
   means everyone connected to the Internet, other than the mail
   infrastructure itself. The Internet Mail infrastructure has been
   standardized (and continues to be standardized by) other documents,
   most notably [RFC5321] and [RFC5322]. Therefore, implementers
   developing mail systems MUST rely on those standards when building
   interoperable mail systems. At the same time, the text of this
   specification has been [[NB: will be]] carefully vetted by [[the
   IETF]] so that implementers SHALL be able to rely on it as a
   normative reference. Whether designing a new standard or implementing
   a new system that uses Internet Mail identifiers for some other
   purpose (e.g., as usernames, security principals, or keys in a
   database), relying parties can "copy-and-paste" the expressions in
   this document to parse, validate, compose, and process Internet Mail
   identifiers, rather than relying on homegrown solutions.

   Internet Mail has evolved over forty years, and will undoubtedly
   continue to evolve over time. This document does not constrain that
   development process. Actually, it is expected that expressions in
   this document will be updated to match changes in Internet Mail.

1.3. Definitions

   The terms "email address" and "address" (without qualification) refer
   to the string commonly associated with a single mailbox at a single
   named entity. In this document, the prose text always qualifies these
   terms with the source document when using a different sense.

   The term "Message-ID" (without qualification) refers to the globally
   unique string comprised of a left part, the at-sign (@), and a right-
   part, that is used to identify a single message. In this document,
   the prose-text always qualifies this term when using a different
   sense. The term "Message-ID field", for example, refers to the
   Message-ID Header Field, which includes the characters "Message-ID:"
   and the surrounding "<" and ">" angle brackets.

   Unquoted all-capital symbols in prose text have the meanings
   specified in [ABNFMORE].

   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
   "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and
   "OPTIONAL" in this document are to be interpreted as described in
 


Leonard                  Best Current Practice                  [Page 4]

Internet-Draft   Regular Expressions for Internet Mail    March 21, 2016


   [RFC2119].

   This document provides expressions that can be used directly in
   popular computing platforms. An important subset of email address
   syntaxes, namely deliverable email addresses, can be described in a
   regular language [[CITE]]. A regular language is a language
   recognized by (and computable with) a finite automaton. [[CITE:
   Kleene's Theorem]] However, the full syntax of email addresses
   requires a context-free language (i.e., governed by ABNF) and a
   pushdown automaton.

   The term "regular expression" is a sequence of characters that define
   a search pattern for string matching. Originally the term referred to
   expressions that described regular languages in formal language
   theory. Formal regular expressions are limited to:
      o empty set and empty string
      o literal character
      o concatenation
      o alternation
      o repetition (Kleene star)

   Modern-day libraries support expressions that far exceed the regular
   languages. Some libraries even support capabilities that exceed the
   context-free languages. However, this document limits itself to truly
   regular grammars where possible, and where not possible, to context-
   free grammars. Implementers can therefore implement (or compile)
   these specifications on computing-constrained devices.

   The regular expressions in this document are intended to conform to
   the following de-jure or de-facto standards. Where expressions are
   given, they are annotated with single characters that refer to the
   standards to which they conform. [[NB: or, are intended to conform
   to, after further development.]]

      1 4       Title                                      Ref
      -+-------+------------------------------------------+---------
      P PCRE:   Perl Compatible Regular Expressions        [PCRE838]
                (version 8.38, November 23, 2015)
      E (P)ERE: POSIX Extended Regular Expressions         [POS-ERE]
                (POSIX/IEEE Std 1003.1, Issue 7, 2013 Ed.)
                [[NB: could be ERE, PERE, or P-ERE]]
      J JSRE:   JavaScript Regular Expressions             [JSRE6ED]
                (ECMAScript/ECMA-262, 6th Ed., 2015)

   Implementers should exercising caution when using a library that
   claims to be "Perl-Compatible" without actually being the bona-fide
   PCRE library: it may exhibit different or incomplete behavior.
   Implementers should also note that ERE and JSRE are fully implemented
 


Leonard                  Best Current Practice                  [Page 5]

Internet-Draft   Regular Expressions for Internet Mail    March 21, 2016


   as alternative grammars in the std::regex library of C++11 and its
   successor, C++14. [[TODO: cite C++ standards.]] In the absence of a
   "live" regular expression library, the expressions in this document
   are easily compiled into automata (i.e., target language code) using
   well-studied algorithms.

   Surrounding delimiters (i.e., slashes) are omitted unless relevant to
   the proffered usage.

2.  History and Formal Models for Internet Mail Identifiers

2.1. The Core History

   Internet Mail (also known as electronic mail, e-mail, email, or
   simply "mail") is an asynchronous, store-and-forward method of
   exchanging digital messages from an author to one or more recipients.
   [MTECHDEV] recounts the technical development of email, which is too
   voluminous to be repeated here. This section specifically focuses on
   the history of the identifiers used in email.

   When people think of an email address, <user@example.com> is the
   quintessential example. However, addresses did not always look this
   way. Electronic mailing systems come from the earliest days of
   networking, before ARPANET. The specification for identifiers was
   defined by the networking project on a more-or-less ad-hoc basis.
   When ARPANET began, mail and file transfer were seen as important
   founding services. In fact, FTP was a significant transport mechanism
   for mail.

   By 1973, users of ARPANET came together to propose [RFC524] and
   standardize [RFC561] parts of the mail system. In these early
   specifications, the at-sign "@" was not used; instead, the term "AT"
   separated the left-side (user production) from the right-side (host
   production). Tokens (the word production, specifically) during that
   time were separated by SP, CR, and LF. The word production was
   defined at that time to be any ASCII character other than CR, LF, and
   SP. [RFC524] only standardized the From header, not any recipient
   headers.

   In 1975, [RFC680] proposed a more formal format for for message
   fields beyond [RFC561], including "receiver specification fields"
   (TO, CC, and BCC) and "reference specification fields" (MESSAGE-ID,
   IN-REPLY-TO, REFERENCES, and KEYWORDS) for the first time. The
   receiver fields used mailbox productions with user and host parts
   separated by "@" (without spaces), in contrast to the originator
   specification fields (specifically FROM and SENDER), which continued
   to use "AT" (with single spaces on either side). An [RFC680] Message-
   ID was structured with a "Net Address" (presumably, a host name) in
 


Leonard                  Best Current Practice                  [Page 6]

Internet-Draft   Regular Expressions for Internet Mail    March 21, 2016


   square brackets, followed by a line production (any characters other
   than CR or LF).

   The first real email standards that resemble modern usage were
   published in 1977: [RFC724] and its revision, [RFC733]. Those
   documents are historic and their formats are incompatible with most
   modern mail systems, but understanding them provide important
   insights into the structure of identifiers starting with [RFC821] and
   [RFC822]. The "@" character gained greater prominence, although "AT"
   (then lowercased to "at" in the specifications) was still supported.
   Importantly, [RFC733] Message-ID was standardized to match the
   [RFC733] email address format, and a uniqueness guarantee was added:
   "The uniqueness of the message identifier is guaranteed by the host
   which generates it." Essentially, this text implies that the right-
   hand part of the Message-ID is to be a hostname, or operatively
   associated with a hostname (i.e., not just a random string, but a
   unique--possibly random--string assigned by a host). At this point,
   Message-ID and email address specifications converged in the
   host-phrase production.

   [RFC821] and [RFC822] are the foundational RFCs for modern mail
   usage, and their revisions over the years retain the same basic
   structure and division between mail transfer (specifically, SMTP) and
   mail format. Jon Postel's invention of SMTP grew out of experience
   and disenchantment with the Mail Transfer Protocol (MTP) [RFC772].
   The Simple Mail Transfer Protocol is indeed simple: it is structured
   like FTP but only has a limited set of commands: HELO, MAIL FROM,
   RCPT TO, DATA, QUIT, VRFY, and EXPN (and a few others not relevant
   for this discussion). The commands can take at most one argument.
   [RFC821] describes a forward-path argument rather than an email
   address argument: the distinction is that after the username, a
   series of "@" host specifications designate the hops through which
   the message is supposed to travel before it reaches its destination.

   The main [RFC821] production for an email address is the <mailbox>,
   which is defined as <local-part> "@" <domain>. The <local-part>
   production can be a <dot-string> or a <quoted-string>. In both cases,
   the full range of ASCII characters is actually permitted, although
   different characters must be backslash-escaped in different
   productions. The <domain> production is extremely broad and can
   "stack" domain-element components with the period "."; a domain-
   element component can be a hostname, "#" and a number, or an IPv4
   address enclosed in square brackets "[" and "]".

   [RFC822] introduced the "addr-spec" ABNF production, which is that
   series' term for an email address. While a [RFC822] route-addr
   production can include a source route (aka forward-path with multiple
   hosts), addr-spec is noted to be global address, with the right side
 


Leonard                  Best Current Practice                  [Page 7]

Internet-Draft   Regular Expressions for Internet Mail    March 21, 2016


   being a "domain" production. This definition presaged the first DNS
   standards [RFC882] and [RFC883], although it was clearly designed
   with DNS in mind. The [RFC822] ABNF permits a domain to include
   multiple domain-literal productions (i.e., bracketed) separated by
   "."; however, the accompanying text basically obviates such
   productions. As [RFC822] presaged the widespread implementation of
   DNS, various systems would spread routing information between the
   local-part and domain productions (see Section 6.2.5). [RFC822]
   discusses local-part syntax extensively, including examples of
   comment productions that are supposed to be ignored semantically (see
   Section A.1).

   Section 3.4.7 of [RFC822] describes the constituent components of an
   address as requiring "preservation of case information", which is
   slightly different than saying "case-sensitive" outright (although
   the latter is strongly implied). The main historical point to glean
   is that intermediate mail systems were supposed to transit the local-
   part AS-IS without modification, so that the destination system--and
   only the destination system--would parse it.

   [RFC822] assigns specific semantics to Message-ID but is light on
   syntax: the msg-id production is just addr-spec enclosed in mandatory
   "<" and ">".

   The widespread deployment of DNS [RFC973] (later [RFC1034] and
   [RFC1035], with [RFC1123] relaxation) dramatically changed the
   Internet messaging landscape of the late 1980s. [[TODO: complete.]]
   With respect to email addresses and Message-IDs, it became obvious
   that the right-hand side of the "@" was supposed to represent a
   fully-qualified domain name at which Mail eXchanger records are
   located [[TODO: cite]]. Other networks and host-naming formats became
   obsolete by the mid-1990s.

   The 2001 standards [RFC2821] and [RFC2822] reflect the explosive
   growth and diverse deployments of Internet mail. The work was
   undertaken in the DRUMS (Detailed Revision/Update of Message
   Standards) working group between 1995 and 2001. [RFC2822] introduced
   the "obs-" prefix in its ABNF, along with Section 4, Obsolete Syntax.
   Essentially, [RFC2822] prescribes a generation format that is much
   stricter than the parsing format, but still demands that conforming
   implementations understand the parsing format. Therefore, the U+0000
   NULL character that is considered obsolete in [RFC2822] must still be
   considered part of the data model of email addresses. The syntax of
   [RFC2821] is tighter than [RFC2822], and a careful read makes it
   apparent that the underlying address formats diverged in the
   intervening years. Specifically, [RFC2822] is much more concerned
   with historical forms than [RFC2821], which is more about
   contemporary transmission behavior. At the same time, [RFC2821] does
 


Leonard                  Best Current Practice                  [Page 8]

Internet-Draft   Regular Expressions for Internet Mail    March 21, 2016


   not actually prohibit a wide range of C0 control characters, which
   still remain part of [RFC2821]'s data model. [[TODO: complete history
   of RFC 282x.]]

   Revisions to the base mail standards most recently completed,
   [RFC5321] and [RFC5322], were worked on between 2005 and 2008. Email
   addresses saw further character restrictions, namely around the
   entire range of C0 control characters, which [RFC5321] explicitly
   prohibits. [[TODO: complete history of RFC 532x.]]


2.2. Multipurpose Internet Mail Extensions and
     Uniform Resource Identifiers

   [[Message-ID and Content-ID. Therefore this document applies with
   equal normative force to Content-IDs, and to mid: and cid: URIs that
   use them.]]

2.3. Email Address Internationalization

   As the Internet became a ubiquitous feature of modern life, and as
   email followed it, users in various countries called for identifiers
   that were usable in their native languages. The IETF [[worked on
   this]] in the email address internationalization (EAI) effort,
   culminating in [RFC6530], [RFC6531], and [RFC6532]. The key changes
   for identifiers were to expand the character repertoire of email
   addresses, and anything looking like email addresses (e.g., Message-
   ID), to include UTF-8 octets. As UTF-8 can encode any Unicode scalar
   value (but not any Unicode code point), the practical result is that
   addresses and Message-IDs can contain (almost) any Unicode scalar
   value.

   A practical result of EAI combined with MIME and MIME URIs, is that
   MIME URIs now are UTF-8 encoded in the pct-encoded production.

2.4. The Data Model

   Parties that rely on this document SHALL interpret the semantically
   meaningful parts of Internet Mail identifiers as follows.









 


Leonard                  Best Current Practice                  [Page 9]

Internet-Draft   Regular Expressions for Internet Mail    March 21, 2016


2.4.1 Email Address

   An email address is comprised of a local-part and a domain, separated
   by "@". The parsed local-part is a sequence of Unicode scalar values,
   and can be represented by any well-formed Unicode encoding. [[NB: The
   following may be CONTROVERSIAL!!]] The parsed domain is a sequence of
   restricted Unicode scalar values that represent some identifier for
   some host on the network. The parsed domain can be:

   1. a hostname of any kind, including (but not limited to)
      a DNS name;
   2. an IPv4 address literal;
   3. an IPv6 address literal;
   4. an address literal, comprised of a standardized tag
      and a sequence of ASCII characters.

   Conveniently, the domain string subtypes can be combined into a
   single well-formed Unicode string, discriminated as follows:

   3. If the string begins with "IPv6:", it is a type 3 IPv6 address,
      and the remainder had better be a valid IPv6 address in
      textual form.
   4. If the string begins with Ldh-str and a colon ":", it is a
      type 4 address literal, and the remainder had better be
      dcontent (which notably is not supposed to contain characters
      beyond ASCII).
   2. If the string has four sets of digits 0-255
      separated by dots, then it is an IPv4 address.
   1. Otherwise, it had better be a domain name (i.e., comprised of
      NR-LDH labels and U-labels, separated by dots).
      [[NB: [RFC1912] says a label can't be all-numeric, but
      then it catalogs some exceptions.]]
   0. Finally, it is "some random Unicode string" that is
      syntactically valid under the most expansive rules,
      but is not useful for delivering or reporting on Internet mail.

2.4.2 Message-ID

   A Message-ID is comprised of a left part (id-left) and a right part
   (id-right), separated by "@". The id-left is a sequence of Unicode
   scalar values, and can be represented by any well-formed Unicode
   encoding. [[NB: The following may be CONTROVERSIAL!!]] The id-right
   is a sequence of Unicode scalar values, and can be represented by ay
   well-formed Unicode encoding.

   (A Content-ID [RFC2045] has the same composition as a Message-ID.)

2.5. Equivalence and Comparison
 


Leonard                  Best Current Practice                 [Page 10]

Internet-Draft   Regular Expressions for Internet Mail    March 21, 2016


2.5.1. Email Address

   Two email addresses are equivalent if the parsed local-part and
   parsed domain values are equivalent.

   Two parsed local-parts are equivalent if their Unicode scalar values
   are equal.

   The special values "postmaster", "abuse", and [TODO: fill out] SHALL
   be compared case-insensitively [TODO: full-width characters? Unicode
   case folding? etc.].

   [Case sensitivity] In all other cases, two parsed local-parts may be
   equivalent if the [TODO: receiving MTA] delivers mail addressed to
   them to the same mailbox. There is no algorithmic comparison to
   determine said equivalence.

   Two parsed domains are equivalent if both have the same type, and the
   values are equivalent. Additionally, an IPv6 address literal is equal
   to an IPv4 address literal when the IPv6 address is an "IPv4-mapped
   IPv6 address", and its IPv4 component equals the IPv4 address of the
   IPv4 address literal.

   1. hostnames (domain names): equal [TODO: RFC-REF, IDNA,
      RFC1034, RFC1035, etc.].
   2. IPv4 addresses: equal [TODO: RFC-REF].
   3. IPv6 addresses: equal [TODO: RFC-REF].
   4. address literal: the standardized tag is equal (case-insensitive),
      and the address value is octet-for-octet equal (case-sensitive),
      or is equal per the rules standardized by the standardized
      tag registration.

2.5.1.1 Case sensitivity of local-part

   Of all equivalence issues, no issue generates more confusion and
   dissent in the email community than the case sensitivity of the
   local-part. Formally local-part is case-sensitive. A significant
   fraction of installed mail servers treat local-part as case-
   insensitive in the ASCII range. (At the time of this writing, EAI has
   not been implemented widely enough to make statements about case
   insensitivity for characters beyond ASCII.) [[TODO: survey and
   statistics.]] Furthermore, many systems [[TODO: quantify]] outside of
   the Internet Mail infrastructure compare the local-part of email
   addresses case-insensitively.

   Historically, local-parts were case-sensitive because of MULTICS.
   However, as time went on they became case-preserving: receiving
   systems would not register additional mailbox names (i.e., local-
 


Leonard                  Best Current Practice                 [Page 11]

Internet-Draft   Regular Expressions for Internet Mail    March 21, 2016


   parts) if the proposed mailbox name differed from an existing mailbox
   name only by case.

   [[TODO: develop recommendation about how wise or unwise it is to go
   one way or another.]] One possible approach is to define an
   additional output, "conditionally equivalent" of the equivalence
   algorithm. Therefore an implementation conforming to this document
   SHALL output one of three states: equivalent, not equivalent, and
   "conditionally equivalent", that is, equivalent if and only if local-
   part is compared case-insensitively in the ASCII range [[TODO: should
   we say in the full Unicode range?]]. Applications implementing this
   algorithm SHOULD NOT treat such a state as "equivalent". For example,
   a user-facing application SHOULD treat this state as a "warning" that
   requires further intervention.

   In modern times, email addresses tend to be emitted [[TODO:
   statistics]] in all-lowercase, when case is normalized. Therefore,
   applications implementing this algorithm [[TODO: document]] that are
   aware that the receiving MTA is case-insensitive, as well as
   applications implementing this algorithm that receive input that is
   case-ambiguous (such as voice input), SHOULD record the local-part in
   all-lowercase unless presented with evidence to the contrary.

   The EAI standards (RFCs 6530-6532) make no mention of case
   sensitivity issues for characters beyond the ASCII range. Permitting
   Unicode scalar values (i.e., UTF-8) opens up a whole range of
   comparison issues with potentially far-reaching identity and security
   implications. [[TODO: discuss case preservation and sensitivity
   issues in characters beyond the ASCII range. The bottom line is, use
   what you're given, don't mess with it, hope for the best.]]

2.5.2. Message-ID

   Two Message-IDs are equivalent if the parsed id-left and parsed id-
   right values are equivalent.

   Two parsed id-left values are equivalent if their Unicode scalar
   values are equal.

   [[NB: controversial!]] Two parsed id-right values are equivalent if
   their Unicode scalar values are equal.

   [[TODO: The case sensitivity of id-right has not been fully explored
   in any standard to-date. To the extent that id-right represents a
   domain name, there is a strong argument to treat id-right as case-
   insensitive in the ASCII range. Standardized tags are probably case-
   insensitive based on the ABNF of [RFC5321] relating to "IPv6". The
   rest is kind of up for grabs. The bottom line is, if you intend to
 


Leonard                  Best Current Practice                 [Page 12]

Internet-Draft   Regular Expressions for Internet Mail    March 21, 2016


   match the Message-ID of an existing message, don't take chances: just
   copy it verbatim into the destination.]]

   (Two Content-IDs are equivalent under the same rules. However, a
   Content-ID and a Message-ID are never equal to each other, and if
   such a thing occurs, it is not correct because both Content-ID and
   Message-ID are supposed to be "world-unique" [RFC2045] [RFC5322].)

3.  Regular Expressions for Email Addresses

   [[Valid email address vs. deliverable email address]] A "deliverable
   email address" complies with the modern production rules of [RFC5321]
   and [RFC6531]. A deliverable email address SHALL have a domain part
   that is a domain name (Section 2.3.5 of [RFC5321]); it SHALL NOT have
   a domain part that is an address literal (Section 4.1.3 of [RFC5321])
   or a host name that does not comply with domain name rules (see
   [RFC1034], [RFC1035], [RFC5890] et. seq.). [[TODO: justify. Basically
   experiments have borne out that many mail systems will not accept
   user@[3.3.3.3] as a RCPT TO. Technically it is valid per RFC 5321,
   but practically any receiving MTA that handles more than one
   MX/domain will have difficulty in figuring out what domain-specific
   mailbox to which to deliver the mail. The author tried a couple of
   popular MTAs.]] Systems that use email addresses with an expectation
   of SMTP delivery SHALL accept productions that comply with this
   document.

   Email addresses that do not meet these modern production rules may
   nevertheless be valid under the other modern (e.g., [RFC5322]) or
   legacy (e.g., [RFC821] and [RFC822]) production rules.

   Systems that recognize "modern email addresses" in new corpa (e.g.,
   in text editors) SHALL accept productions that comply with this
   document.

   Systems that recognize "legacy email addresses" in existing corpa
   (e.g., in email messages or documents predating this document) SHALL
   accept productions that comply with this document.

3.1. Deliverable Email Address

   A deliverable email address is an email address that can be used to
   deliver messages over the modern SMTP infrastructure. This has
   important implications for the domain part, because the domain part
   MUST represent a domain name that complies with contemporary rules
   and regulations, such as [RFC5890].

3.1.1. Basic Rules of Derivation

 


Leonard                  Best Current Practice                 [Page 13]

Internet-Draft   Regular Expressions for Internet Mail    March 21, 2016


   The following rules are amalgamated from the SMTP standards [RFC5321]
   and [RFC6531], the foundational DNS standards [RFC1034], [RFC1035],
   and [RFC1123], and the modern IDNA standards [RFC5890], [RFC5891],
   and [RFC5892].

   [[NB: The syntax below assumes that Perl Compatible Regular
   Expressions [PCRE838] is being used, such that \xnn and \x{...} refer
   to valid Unicode scalar values, i.e., well-formed UTF-8 sequences.
   Specifically, the surrogate range U+D800-U+DFFF is omitted. Although
   listed as JSRE and ERE-compatible, these expressions will need to be
   massaged somewhat to handle the Unicode-referencing differences.]]

   [[TODO: There need to be two forms: a form that matches a complete
   string, so the regular expression can start with ^ and end with $.
   This will make the execution pretty fast. A second form can match any
   string in a block of text. This will be much more intensive because ^
   and $ cannot be used; instead the boundary could be delimited by any
   of a HUGE yet noncontiguous quantity of characters beyond ASCII, such
   as fullwidth punctuation, spaces of various kinds, etc.]]

P E J atext = [A-Za-z0-9!#-'*+\-/=?^_`{-~\xA0-\x{10FFFF}]

P E J dot-string = [A-Za-z0-9!#-'*+\-/=?^_`{-~\xA0-\x{10FFFF}]+
                   (?:\.[A-Za-z0-9!#-'*+\-/=?^_`{-~\xA0-\x{10FFFF}])

P E J qtext = [ !#-\[\]-~\xA0-\x{10FFFF}]

P E J quoted-pair = \\[ -~]

P E J qcontent = [ !#-\[\]-~\xA0-\x{10FFFF}]|\\[ -~]

P E J quoted-string = "(?:[ !#-\[\]-~\xA0-\x{10FFFF}]|\\[ -~])*"

P E J local-part = [A-Za-z0-9!#-'*+\-/=?^_`{-~\xA0-\x{10FFFF}]+
             (?:\.[A-Za-z0-9!#-'*+-/=?^_`{-~\xA0-\x{10FFFF}])|
             "(?:[ !#-\[\]-~\xA0-\x{10FFFF}]|\\[ -~])*"

; RFC 5890, must contain at least one non-ASCII character
; TODO: express other constraints such as in Protocol document [RFC5891]
; and Tables document [RFC5892]
; WARNING: intentional omission: dot . is included in this production--
; it should not be
      U-label = [[TOO COMPLEX TO DO RIGHT NOW...]]
P E J U-label = [x00-x{10FFFF}]*[x80-x{10FFFF}]+[x00-x{10FFFF}]*

P E J sub-domain = [A-Za-z0-9](?:[A-Za-z0-9\-]*[A-Za-z0-9])?|
                   [\x00-\x{10FFFF}]*[\x80-\x{10FFFF}]+
                   [\x00-\x{10FFFF}]*
 


Leonard                  Best Current Practice                 [Page 14]

Internet-Draft   Regular Expressions for Internet Mail    March 21, 2016


P E J domain = (?:[A-Za-z0-9](?:[A-Za-z0-9\-]*[A-Za-z0-9])?|
               [\x00-\x{10FFFF}]*[\x80-\x{10FFFF}]+[\x00-\x{10FFFF}]*)
               (?:\.(?:[A-Za-z0-9](?:[A-Za-z0-9\-]*
               [A-Za-z0-9])?|[\x00-\x{10FFFF}]*
               [\x80-\x{10FFFF}]+[\x00-\x{10FFFF}]*))*

3.1.2. Complete Expression for Deliverable Email Address

   The following regular expression is a deliverable email address:

; Mailbox from RFC 5321, as amended
P E J DEA     = ([A-Za-z0-9!#-'*+\-/=?^_`{-~\xA0-\x{10FFFF}]+
                 (?:\.[A-Za-z0-9!#-'*+\-/=?^_`{-~\xA0-\x{10FFFF}])|
                 "(?:[ !#-\[\]-~\xA0-\x{10FFFF}]|\\[ -~])*")@
                ((?:[A-Za-z0-9](?:[A-Za-z0-9\-]*[A-Za-z0-9])?|
                 [\x00-\x{10FFFF}]*[\x80-\x{10FFFF}]+[\x00-\x{10FFFF}]*)
                 (?:\.(?:[A-Za-z0-9](?:[A-Za-z0-9\-]*
                 [A-Za-z0-9])?|[\x00-\x{10FFFF}]*
                 [\x80-\x{10FFFF}]+[\x00-\x{10FFFF}]*))*)

   In the regular expression DEA, capturing group 1 is the local-part
   production, and capturing group 2 is the domain production.

3.1.3. Using Character Classes

   [[TODO: provide expressions that use character classes, and explain
   the benefits and tradeoffs.]]

3.1.4. "Flotsam" and "Jetsam" Beyond ASCII

   As mail usage is international in scope, modern mail and mail
   identifier-using systems MUST support Unicode EAI identifiers.
   Unfortunately, rigorously following the EAI specifications [RFC6530],
   [RFC6531], and [RFC6532] will lead to (possibly) unforeseen text
   parsing problems, where naive (or strictly conforming) parsers will
   tend both to overconsume and underconsume non-ASCII text surrounding
   an otherwise "obvious" e-mail address. The problem is described in
   this section, while the following Section 3.1.3 provides at least
   some partial mitigations.

   The right-hand side of a deliverable email address is a domain name.
   A conforming parser may well overconsume text on the right-hand side,
   aka "jetsam", that cannot possibly be in a domain name, such as non-
   ASCII punctuation, spaces, control characters, and noncharacter code
   points. On the left-hand side (local-part), for better or for worse,
   the atext production of local-part has been extended in both
   [RFC6531] and [RFC6532] to accept any Unicode character beyond the
   ASCII range. Therefore, a whole slew of "flotsam" can get validly
 


Leonard                  Best Current Practice                 [Page 15]

Internet-Draft   Regular Expressions for Internet Mail    March 21, 2016


   prepended to an internationalized email address. Apparently the only
   way to "stop" this is to quote the local-part.

   Characters will get inconsistent treatment depending on which end the
   characters appear. For example: the characters U+FF1C FULLWIDTH LESS-
   THAN SIGN and U+FF1E FULLWIDTH GREATER-THAN SIGN are classified as
   [Sm] aka "Symbol, Math", which are DISALLOWED under [RFC5892].
   Therefore the presence of U+FF1E on the domain end will terminate the
   address (for a properly implemented regular expression), but the
   corresponding presence of U+FF1C on the local-part end will NOT
   terminate the address! A conforming parser would actually start much
   earlier in a blob of text (possibly as early as the beginning of a
   new line) and match all characters up to the @ delimiter, blowing
   straight through U+FF1C.

3.1.3. Certain Expressions for Restrictions

   Email addresses incorporate internationalized domain names, so the
   complex and confusing rules of IDNs apply directly to the right-hand
   side of deliverable email addresses.

   [[TODO: integrate these expressions into the main expressions of
   3.1.2.]] The following expressions apply to an individual sub-domain
   production:

   [[NB: open issue if length should be restricted. Author believes it
   should be length-restricted, because overlong labels in domain names
   mean the address can't be looked up, and therefore, the address is
   not deliverable.]]

   The following lookahead regular expressions apply on a per-label
   (i.e., per-sub-domain) basis.

P E J (?=[^.]{1,63}\.|$)
        restricts to 63 characters

P E J (?=[0-z]{1,63}\.|$)
        restricts to 63 LDH characters

P E J (?=[0-z\xA0-\x{10FFFF}]{1,56}\.|$)(?!..--)
      (?=[0-z\xA0-\x{10FFFF}]{0,55}[\xA0-\x{10FFFF}][0-z]{0,55}\.|$)
      or
      (?=[0-z\xA0-\x{10FFFF}]{1,56}\.|$)(?!..--)(?![0-z]{1,56}\.|$)
        Enforces that for U-labels (where at least one non-ASCII char
        is present), there cannot be more than 55 chars. The equivalent
        ACE will include xn-- PLUS -2rh, which means 7 extra characters
        (8 chars minus 1 Unicode char). Did a test with U+00DE.
        Also not allowed to have any any HYPHEN HYPHEN
 


Leonard                  Best Current Practice                 [Page 16]

Internet-Draft   Regular Expressions for Internet Mail    March 21, 2016


        (Section 4.2.3.1 of [RFC5891]).

[[NB: Should we permit fullwidth and other dots (not just ASCII dot)?]]

P E J (?![0-9]{1,63}\.|$)
        restricts out all-numeric labels [RFC1912]
        [[TODO: Would it be more accurate to say that the all-numeric
        labels 0-255 are prohibited, but 256+ are permissible?]]

[[NB: The end-of-address production \.|$ must also include the
possibility of any number of non-domain name characters, when searching
through an arbitrary block of text.]]

[[TODO: the flotsam on the local-part end is potentially a big problem,
unless we say that deliverable email addresses MUST be delimited on the
left-hand side by the ASCII character < or other well-established
characters that are not in dot-atom-text/dot-string.]]

3.1.4. Unquoting Local-Part

The following regular expressions can be used to unquote the local-part
production.

P sed s/^"(.*)"$/$1/
        Applies when the local-part is isolated from @domain.
        Removes surrounding quotations.

P sed s/\\(.)/$1/g
        Applies when the local-part is isolated from the surrounding
        quotations. (Safe to use with dot-string aka dot-atom-text,
        since backslashes are not present.) Unquotes quoted-pairs.

P   J /^"|(?=.+"$)\\(.)|"$/g  (with "$1" replacement string)
P sed s/^"|(?=.+"$)\\(.)|"$/$1/g
        Applies when local-part is isolated from @domain.
        Removes surrounding quotations and unquotes quoted-pairs
        in one loop by using lookahead.

   [[TODO: add more detailed descriptions of operations.]]

3.1.5. Quoting Local-Part

   Given a Unicode string that represents the unquoted local-part, the
   following regular expressions can be used to create a quoted
   production.

    J /(?=[^"\]*["\])(["\])/\$1/g
        Quote each and every " and \.
 


Leonard                  Best Current Practice                 [Page 17]

Internet-Draft   Regular Expressions for Internet Mail    March 21, 2016


    J /^(?![A-Za-z0-9!#-'*+\-/=?^_`{-~\xA0-\x{10FFFF}]+
      (?:\.[A-Za-z0-9!#-'*+\-/=?^_`{-~\xA0-\x{10FFFF}])$).*$/"$&"/
        If the string does not conform to dot-string (including,
        e.g., the presence of consecutive dots), surround the
        entire string with quotations.

3.2. Modern Email Address

   [[TODO: expand regular expressions. This may include C0 and C1
   control characters because we are talking about RFC 822/2822/5322.
   This will definitely include domain-literals. The complex rules
   regarding U-labels might be ignorable.]]

   [[TODO: deal with FWS, CFWS. Note that CFWS requires a pushdown
   automaton, which a truly "regular" expression cannot handle. Need
   recursive subroutines such as (?1) syntax, or backreference (\1)
   syntax.]]

3.3. Legacy Email Address

   [[TODO: expand regular expressions. Arguably, characters beyond ASCII
   need not be included. Therefore domain should be MUCH simpler: the
   name form should be restricted to LDH-strings.]]

3.4. Algorithms for Detecting Email Addresses

   As Section 3.1 indicates, compiling and executing a true and correct
   regular expression for an email address (deliverable, valid,
   historic) will be complicated and time-consuming. More efficient
   algorithms are desirable.

   Scan text buffer for "@".
   Evaluate characters after "@" for domain production.
   Evaluate character prior to "@".
   If prior character is <">, scan backwards for initial (unescaped)
   <">; evaluate characters in between to match quoted-string
   production.
   Otherwise if prior character is a valid atext, consume characters
   backwards while evaluating for dot-string. (The dot-string production
   is palindromic.)

   [[TODO: add other suggested algorithms.]]

   [[Splitting valid email address into local-part and right-hand-
   side.]]

3.5. Handling Domain Names

 


Leonard                  Best Current Practice                 [Page 18]

Internet-Draft   Regular Expressions for Internet Mail    March 21, 2016


   [[TODO: discuss the issues with handling domain names.]]

4.  Regular Expressions for Message-IDs

   Message-ID values form a disjoint set from email address values,
   i.e., a Message-ID that also happens to be an email address is just a
   coincidence.

   The productions that comprise Message-ID are called id-left and
   id-right.

   [[NB: This section remains to be written. What follows are
   pointers.]]

   A "modern Message-ID" is one that complies with the strict generation
   rules of [RFC5322]. In particular: id-left is only dot-atom-text (as
   amended by [RFC6532], and id-right is dot-atom-text or no-fold-
   literal (as amended by [RFC6532]). Notably, virtually any Unicode
   scalar value is permissible in id-right, because [RFC6532] does not
   import U-label (unlike [RFC6531]). The resulting regular expressions
   will therefore be more expansive, at the cost of accepting
   characters, such as fullwidth punctuation, that would otherwise
   delimit Message-IDs on both ends in text.

   A "general Message-ID" is one that complies with any of the mail
   rules.

5.  Security Considerations

   Internet Mail identifiers are important for identifying users and
   other principals in security systems. While a user's login token and
   an email address are formally separate entities, many common
   Internet-connected systems conflate the two. Systems that accept
   email addresses as login tokens (in particular, other systems' email
   addresses, rather than their own) SHALL accept the full range of
   valid email addresses. To do otherwise is to act as a denial of
   service against legitimate users with legitimate mailbox names.

   When a user forgets his or her password or other login credentials,
   the most common recovery method on the Internet is to send a recovery
   message to the user's registered [[email address]]. Preventing users
   from using their chosen or assigned addresses acts as a denial of
   service.

   Because a local-part can contain almost any Unicode scalar value,
   security issues are essentially pushed from clients to servers and to
   registration processes. I.e., it is up to a server implementation to
   decide whether to accept an arbitrary Unicode string for registration
 


Leonard                  Best Current Practice                 [Page 19]

Internet-Draft   Regular Expressions for Internet Mail    March 21, 2016


   or for delivery with SMTP, folding or normalizing input at its
   discretion. A robust server implementation needs to handle arbitrary
   input gracefully.

   In contrast, when a domain part represents a domain name, the string
   is severely restricted by the IDNA documents [RFC5890] et. seq. A
   server is perfectly within its rights to reject input that is not in
   NFC or contains disallowed characters. Specifically, since the domain
   part is the key to retrieving the MX resource record, the Internet
   Mail standards hardly get involved. These restrictions put more onus
   on clients to validate strings. However, integrating the entire
   static list of [RFC5892] into regular expressions would be unduly
   burdensome on many implementations. While an implementation can
   consider using character classes, the risks and benefits of using
   character classes need to be carefully considered.

   Character classes represent a shorthand for certain ranges of
   characters based on Unicode properties. Effectively the total
   system's state table remains the same, but the complexity is pushed
   from one component (the regular expression definition) to another
   component (the table representing the character classes). Ultimately,
   the regular expression character classes in popular formulations will
   derive from the Unicode Standard [UNICODE] definitions such as
   UnicodeData.txt. Since a proper IDNA-enabled DNS library needs to
   keep track of these character classes anyway, referencing these
   ranges by character classes should not add much to the image size.
   However, now the regular expression (which previously was self-
   contained) now has an external dependency to data that may--and
   probably will--frequently change, including (for example) the ranges
   of unassigned code points. What could have been a valid email address
   one month may be invalid the next month, or vice-versa, simply
   depending on the version of the regular expression or DNS library
   that an implementation depends on. Implementers should therefore
   evaluate their own needs for security and stability in picking
   particular regular expression forms.

8.  Acknowledgements

   [[TODO: add acknowledgements]].

9.  References

9.1.  Normative References

   [ABNFMORE] Leonard, S., "Comprehensive Core Rules and References for
              ABNF", draft-seantek-abnf-more-core-rules-05 (work in
              progress), March 2016.

 


Leonard                  Best Current Practice                 [Page 20]

Internet-Draft   Regular Expressions for Internet Mail    March 21, 2016


   [RFC821]   Postel, J., "Simple Mail Transfer Protocol", STD 10, RFC
              821, DOI 10.17487/RFC0821, August 1982, <http://www.rfc-
              editor.org/info/rfc821>.

   [RFC822]   Crocker, D., "STANDARD FOR THE FORMAT OF ARPA INTERNET
              TEXT MESSAGES", STD 11, RFC 822, DOI 10.17487/RFC0822,
              August 1982, <http://www.rfc-editor.org/info/rfc822>.

   [RFC973]   Mockapetris, P., "Domain system changes and observations",
              RFC 973, DOI 10.17487/RFC0973, January 1986,
              <http://www.rfc-editor.org/info/rfc973>.

   [RFC1034]  Mockapetris, P., "Domain names - concepts and facilities",
              STD 13, RFC 1034, DOI 10.17487/RFC1034, November 1987,
              <http://www.rfc-editor.org/info/rfc1034>.

   [RFC1035]  Mockapetris, P., "Domain names - implementation and
              specification", STD 13, RFC 1035, DOI 10.17487/RFC1035,
              November 1987, <http://www.rfc-editor.org/info/rfc1035>.

   [RFC1123]  Braden, R., Ed., "Requirements for Internet Hosts -
              Application and Support", STD 3, RFC 1123, DOI
              10.17487/RFC1123, October 1989, <http://www.rfc-
              editor.org/info/rfc1123>.

   [RFC2045]  Freed, N. and N. Borenstein, "Multipurpose Internet Mail
              Extensions (MIME) Part One: Format of Internet Message
              Bodies", RFC 2045, DOI 10.17487/RFC2045, November 1996,
              <http://www.rfc-editor.org/info/rfc2045>.

   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate
              Requirement Levels", BCP 14, RFC 2119, DOI
              10.17487/RFC2119, March 1997, <http://www.rfc-
              editor.org/info/rfc2119>.

   [RFC2821]  Klensin, J., Ed., "Simple Mail Transfer Protocol", RFC
              2821, DOI 10.17487/RFC2821, April 2001, <http://www.rfc-
              editor.org/info/rfc2821>

   [RFC2822]  Resnick, P., Ed., "Internet Message Format", RFC 2822, DOI
              10.17487/RFC2822, April 2001, <http://www.rfc-
              editor.org/info/rfc2822>.

   [RFC5321]  Klensin, J., "Simple Mail Transfer Protocol", RFC 5321,
              DOI 10.17487/RFC5321, October 2008, <http://www.rfc-
              editor.org/info/rfc5321>.

   [RFC5322]  Resnick, P., Ed., "Internet Message Format", RFC 5322, DOI
 


Leonard                  Best Current Practice                 [Page 21]

Internet-Draft   Regular Expressions for Internet Mail    March 21, 2016


              10.17487/RFC5322, October 2008, <http://www.rfc-
              editor.org/info/rfc5322>.

   [RFC5890]  Klensin, J., "Internationalized Domain Names for
              Applications (IDNA): Definitions and Document Framework",
              RFC 5890, DOI 10.17487/RFC5890, August 2010,
              <http://www.rfc-editor.org/info/rfc5890>.

   [RFC5891]  Klensin, J., "Internationalized Domain Names in
              Applications (IDNA): Protocol", RFC 5891, DOI
              10.17487/RFC5891, August 2010, <http://www.rfc-
              editor.org/info/rfc5891>.

   [RFC5892]  Faltstrom, P., Ed., "The Unicode Code Points and
              Internationalized Domain Names for Applications (IDNA)",
              RFC 5892, DOI 10.17487/RFC5892, August 2010,
              <http://www.rfc-editor.org/info/rfc5892>.

   [RFC6530]  Klensin, J. and Y. Ko, "Overview and Framework for
              Internationalized Email", RFC 6530, DOI 10.17487/RFC6530,
              February 2012, <http://www.rfc-editor.org/info/rfc6530>.

   [RFC6531]  Yao, J. and W. Mao, "SMTP Extension for Internationalized
              Email", RFC 6531, DOI 10.17487/RFC6531, February 2012,
              <http://www.rfc-editor.org/info/rfc6531>.

   [RFC6532]  Yang, A., Steele, S., and N. Freed, "Internationalized
              Email Headers", RFC 6532, DOI 10.17487/RFC6532, February
              2012, <http://www.rfc-editor.org/info/rfc6532>.

   [PCRE838]  Hazel, P., Perl Compatible Regular Expressions, version
              8.38, November 23, 2015, <http://www.pcre.org/>.

   [POS-ERE]  IEEE Std 1003.1, 2013 Edition (incorporates IEEE Std
              1003.1-2008 and IEEE Std 1003.1-2008/Cor 1-2013, "Standard
              for Information Technology - Portable Operating System
              Interface (POSIX(R)) Base Specifications, Issue 7"
              (incorporating Technical Corrigendum 1), Section 9.4,
              "Extended Regular Expressions", April 2013,
              <http://pubs.opengroup.org/onlinepubs/9699919799/
              basedefs/V1_chap09.html>.

   [JSRE6ED]  Ecma International, "ECMAScript 2015 Language
              Specification", Standard ECMA-262, 6th Edition, June 2015,
              <http://www.ecma-international.org/ecma-262/6.0/>.

   [UNICODE]  The Unicode Consortium, "The Unicode Standard, Version
              8.0.0", The Unicode Consortium, August 2015.
 


Leonard                  Best Current Practice                 [Page 22]

Internet-Draft   Regular Expressions for Internet Mail    March 21, 2016


9.2.  Informative References

   [RFC1818]  Postel, J., Li, T., and Y. Rekhter, "Best Current
              Practices", RFC 1818, DOI 10.17487/RFC1818, August 1995,
              <http://www.rfc-editor.org/info/rfc1818>.

   [RFC5598]  Crocker, D., "Internet Mail Architecture", RFC 5598, DOI
              10.17487/RFC5598, July 2009, <http://www.rfc-
              editor.org/info/rfc5598>.

   [MTECHDEV] Partridge, J., "The Technical Development of Internet
              Email", IEEE Annals of the History of Computing, Vol. 30,
              No. 2, DOI 10.1109/MAHC.2008.32, April-June 2008.



































 


Leonard                  Best Current Practice                 [Page 23]

Internet-Draft   Regular Expressions for Internet Mail    March 21, 2016


Appendix A. Test Vectors

   [[NB: Left in as a placeholder.]]

   This appendix will include a large set of test vectors to test
   matching and validation patterns.


Author's Address

   Sean Leonard
   Penango, Inc.
   5900 Wilshire Boulevard
   21st Floor
   Los Angeles, CA  90036
   USA

   EMail: dev+ietf@seantek.com
   URI:   http://www.penango.com/
































Leonard                  Best Current Practice                 [Page 24]