★ An Improved Liberal, Accurate Regex Pattern for Matching URLs

Posted on : 27-07-2010 | By : Benjamin | In : Code

Tags: ,

View Comments

Back in November, I posted a regex pattern for matching URLs. It seems to have proven quite useful for others, and, even better, based on feedback from those who’ve used it, I’ve since improved it in several ways.

The problem the pattern attempts to solve: identify the URLs in an arbitrary string of text, where by “arbitrary” let’s agree we mean something unstructured such as an email message or a tweet.

So, here’s a pattern that attempts to match any sort of URL, using the extended multiline regex format that disregards literal whitespace and allows for comments, which explain a bit about how the pattern works:

(?xi)
\b
(                           # Capture 1: entire matched URL
  (?:
    [a-z][\w-]+:                # URL protocol and colon
    (?:
      /{1,3}                        # 1-3 slashes
      |                             #   or
      [a-z0-9%]                     # Single letter or digit or '%'
                                    # (Trying not to match e.g. "URI::Escape")
    )
    |                           #   or
    www\d{0,3}[.]               # "www.", "www1.", "www2." … "www999."
    |                           #   or
    [a-z0-9.\-]+[.][a-z]{2,4}/  # looks like domain name followed by a slash
  )
  (?:                           # One or more:
    [^\s()<>]+                      # Run of non-space, non-()<>
    |                               #   or
    \(([^\s()<>]+|(\([^\s()<>]+\)))*\)  # balanced parens, up to 2 levels
  )+
  (?:                           # End with:
    \(([^\s()<>]+|(\([^\s()<>]+\)))*\)  # balanced parens, up to 2 levels
    |                                   #   or
    [^\s`!()\[\]{};:'".,<>?«»“”‘’]        # not a space or one of these punct chars
  )
)

Here’s the same pattern in the terse single-line format:

(?i)\b((?:[a-z][\w-]+:(?:/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))

(And you thought the multiline version looked crazy, right?)

Here’s the test data I used while sharpening the pattern. Just like the pattern from November, it attempts to be practical, above all else. It makes no attempt to parse URLs according to any official specification. It isn’t limited to predefined URL protocols. It should be clever about things like parentheses and trailing punctuation.

In addition to being liberal about the URLs it matches, the pattern is also liberal about which regex engines it works with. I’ve tested it with Perl, PCRE (which is used in PHP, BBEdit, and many other places), and Oniguruma (which is used in Ruby, TextMate, and many other places). It should also work in all modern JavaScript interpreters. If you find a modern regex engine where the pattern does not work, please let me know.

Some of the advantages of the new pattern, compared to the previous one:

  • It no longer uses the [:punct:] named character class. I thought this was universally supported in modern regex engines, but apparently it is not.
  • It does a better job with URLs containing literal parentheses, correctly matching the following URLs that the previous pattern did not:
    http://foo.com/more_(than)_one_(parens)
    
    http://foo.com/blah_(wikipedia)#cite-1
    
    http://foo.com/blah_(wikipedia)_blah#cite-1
    
    http://foo.com/unicode_(✪)_in_parens
    
    http://foo.com/(something)?after=parens
    
    
  • It now matches mailto: URLs.
  • It correctly guesses that things like “bit.ly/foo” and “is.gd/foo/” are URLs. Basically: something-dot-something-slash-something.

Included in the parentheses-matching improvements is the ability to match up to two levels of balanced, nested parentheses — parentheses within parentheses. There are fancy ways of using dynamic or recursive regex patterns to match balanced parentheses of any arbitrary depth, but these dynamic/recursive pattern constructs are all specific to individual regex implementations. I.e., there’s one way to do it for PCRE, a different way for Perl — and in most regex engines, no way to do it at all. Hard-coding the pattern to support two levels of nested parenthesis should work everywhere, and, practically speaking, I only received two reports of actualreal-life URLs that had a second level of parentheses, and none with more than two.

Lastly, I received several requests for a version of the pattern that only matches web URLs — http, https, and things like “www.example.com”. Here’s an extended format pattern that does this:

(?xi)
\b
(                       # Capture 1: entire matched URL
  (?:
    https?://               # http or https protocol
    |                       #   or
    www\d{0,3}[.]           # "www.", "www1.", "www2." … "www999."
    |                           #   or
    [a-z0-9.\-]+[.][a-z]{2,4}/  # looks like domain name followed by a slash
  )
  (?:                       # One or more:
    [^\s()<>]+                  # Run of non-space, non-()<>
    |                           #   or
    \(([^\s()<>]+|(\([^\s()<>]+\)))*\)  # balanced parens, up to 2 levels
  )+
  (?:                       # End with:
    \(([^\s()<>]+|(\([^\s()<>]+\)))*\)  # balanced parens, up to 2 levels
    |                               #   or
    [^\s`!()\[\]{};:'".,<>?«»“”‘’]        # not a space or one of these punct chars
  )
)

And here’s the same pattern in single-line format:

(?i)\b((?:https?://|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))

As before, suggestions and improvements are welcome, including just sending me example input where the current pattern fails.

via ★ An Improved Liberal, Accurate Regex Pattern for Matching URLs.

I really like regex. I’ll save this here as a reference.

Limitations of Current HTML5 Video Tag Implementation

Posted on : 21-12-2009 | By : Benjamin | In : Uncategorized

Tags: , ,

View Comments

Why the HTML5 ‘Video’ Element Is Effectively Unusable, Even in the Browsers Which Support It

I seldom post video to DF, but when I do, I refuse to embed Flash,1 I want the markup to be sane and standard, I want the video to play in popular standards-compliant web browsers, and I don’t want the video to download/buffer automatically. Here’s an example from a year ago, using QuickTime.
….
That markup met all of my aforementioned desires but for one: the <embed> tag is not standard. Worse, it now has a new significant problem: it doesn’t work at all in Chrome (at least on the Mac)

In all three browsers (Safari, Chrome, Firefox), with the above simple markup, the video content buffers automatically on page load. What I mean is that as soon as you load the web page, the browsers download the actual video files that are embedded. As stated at the outset, I don’t want that.
….
But this browser behavior is very much undesirable for both publishers and users in common contexts. Users loading the page over a slow connection, or a pay-by-the-megabyte metered connection (which is common with wireless networks), should not be forced to download a potentially large video every time they load the page. Likewise, publishers should not be forced to pay for the bandwidth to transmit videos that won’t be watched.

He also points out that the problem lies, in part, in the HTML5 specification.

Updated: Firefox respects autobuffer

Bad Behavior has blocked 477 access attempts in the last 7 days.

This site is protected with Urban Giraffe's plugin 'HTML Purified' and Edward Z. Yang's Powered by HTML Purifier. 1028 items have been purified.

Performance Optimization WordPress Plugins by W3 EDGE