★ An Improved Liberal, Accurate Regex Pattern for Matching URLs

Posted on : 27-07-2010 | By : Benjamin | In : Code

Tags: ,

View Comments

Back in November, I posted a regex pattern for matching URLs. It seems to have proven quite useful for others, and, even better, based on feedback from those who’ve used it, I’ve since improved it in several ways.

The problem the pattern attempts to solve: identify the URLs in an arbitrary string of text, where by “arbitrary” let’s agree we mean something unstructured such as an email message or a tweet.

So, here’s a pattern that attempts to match any sort of URL, using the extended multiline regex format that disregards literal whitespace and allows for comments, which explain a bit about how the pattern works:

(?xi)
\b
(                           # Capture 1: entire matched URL
  (?:
    [a-z][\w-]+:                # URL protocol and colon
    (?:
      /{1,3}                        # 1-3 slashes
      |                             #   or
      [a-z0-9%]                     # Single letter or digit or '%'
                                    # (Trying not to match e.g. "URI::Escape")
    )
    |                           #   or
    www\d{0,3}[.]               # "www.", "www1.", "www2." … "www999."
    |                           #   or
    [a-z0-9.\-]+[.][a-z]{2,4}/  # looks like domain name followed by a slash
  )
  (?:                           # One or more:
    [^\s()<>]+                      # Run of non-space, non-()<>
    |                               #   or
    \(([^\s()<>]+|(\([^\s()<>]+\)))*\)  # balanced parens, up to 2 levels
  )+
  (?:                           # End with:
    \(([^\s()<>]+|(\([^\s()<>]+\)))*\)  # balanced parens, up to 2 levels
    |                                   #   or
    [^\s`!()\[\]{};:'".,<>?«»“”‘’]        # not a space or one of these punct chars
  )
)

Here’s the same pattern in the terse single-line format:

(?i)\b((?:[a-z][\w-]+:(?:/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))

(And you thought the multiline version looked crazy, right?)

Here’s the test data I used while sharpening the pattern. Just like the pattern from November, it attempts to be practical, above all else. It makes no attempt to parse URLs according to any official specification. It isn’t limited to predefined URL protocols. It should be clever about things like parentheses and trailing punctuation.

In addition to being liberal about the URLs it matches, the pattern is also liberal about which regex engines it works with. I’ve tested it with Perl, PCRE (which is used in PHP, BBEdit, and many other places), and Oniguruma (which is used in Ruby, TextMate, and many other places). It should also work in all modern JavaScript interpreters. If you find a modern regex engine where the pattern does not work, please let me know.

Some of the advantages of the new pattern, compared to the previous one:

  • It no longer uses the [:punct:] named character class. I thought this was universally supported in modern regex engines, but apparently it is not.
  • It does a better job with URLs containing literal parentheses, correctly matching the following URLs that the previous pattern did not:
    http://foo.com/more_(than)_one_(parens)
    
    http://foo.com/blah_(wikipedia)#cite-1
    
    http://foo.com/blah_(wikipedia)_blah#cite-1
    
    http://foo.com/unicode_(✪)_in_parens
    
    http://foo.com/(something)?after=parens
    
    
  • It now matches mailto: URLs.
  • It correctly guesses that things like “bit.ly/foo” and “is.gd/foo/” are URLs. Basically: something-dot-something-slash-something.

Included in the parentheses-matching improvements is the ability to match up to two levels of balanced, nested parentheses — parentheses within parentheses. There are fancy ways of using dynamic or recursive regex patterns to match balanced parentheses of any arbitrary depth, but these dynamic/recursive pattern constructs are all specific to individual regex implementations. I.e., there’s one way to do it for PCRE, a different way for Perl — and in most regex engines, no way to do it at all. Hard-coding the pattern to support two levels of nested parenthesis should work everywhere, and, practically speaking, I only received two reports of actualreal-life URLs that had a second level of parentheses, and none with more than two.

Lastly, I received several requests for a version of the pattern that only matches web URLs — http, https, and things like “www.example.com”. Here’s an extended format pattern that does this:

(?xi)
\b
(                       # Capture 1: entire matched URL
  (?:
    https?://               # http or https protocol
    |                       #   or
    www\d{0,3}[.]           # "www.", "www1.", "www2." … "www999."
    |                           #   or
    [a-z0-9.\-]+[.][a-z]{2,4}/  # looks like domain name followed by a slash
  )
  (?:                       # One or more:
    [^\s()<>]+                  # Run of non-space, non-()<>
    |                           #   or
    \(([^\s()<>]+|(\([^\s()<>]+\)))*\)  # balanced parens, up to 2 levels
  )+
  (?:                       # End with:
    \(([^\s()<>]+|(\([^\s()<>]+\)))*\)  # balanced parens, up to 2 levels
    |                               #   or
    [^\s`!()\[\]{};:'".,<>?«»“”‘’]        # not a space or one of these punct chars
  )
)

And here’s the same pattern in single-line format:

(?i)\b((?:https?://|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))

As before, suggestions and improvements are welcome, including just sending me example input where the current pattern fails.

via ★ An Improved Liberal, Accurate Regex Pattern for Matching URLs.

I really like regex. I’ll save this here as a reference.

Make Users Happy (Not Managers)

Posted on : 19-07-2010 | By : Benjamin | In : IT, tech

Tags: , ,

View Comments

Found this today

Groupware Bad
from here (Google, Pandas, and Lobsters)

Now the problem here is that the product’s direction changed utterly. Our focus in the client group had always been to build products and features that people wanted to use. That we wanted to use. That our moms wanted to use.

“Groupware” is all about things like “workflow”, which means, “the chairman of the committee has emailed me this checklist, and I’m done with item 3, so I want to check off item 3, so this document must be sent back to my supervisor to approve the fact that item 3 is changing from `unchecked’ to `checked’, and once he does that, it can be directed back to committee for review.”


If you want to do something that’s going to change the world, build software that people want to use instead of software that managers want to buy.

….make it trivially easy for someone to….

Good point, too often overlooked.

Who Makes Those Videos?

Posted on : 18-07-2010 | By : Benjamin | In : Humour, business

Tags: ,

View Comments

Sometimes bad can be good. YouTube came out with a new video editor and got the brilliant Yeshmin Blechin to explain it. I’ll go so far as to say that that may be one of the best video product demos I’ve ever seen. And that’s pretty far. One thing that the video demonstrates clearly is that YouTube is not Microsoft.

Microsoft is better known for atrociously bad commercials and demo videos—and PowerPoint presentations and sales meeting pep talks by Steve Ballmer channeling that gorilla in the suitcase commercial—than for performance art catastrophes, but it’s pretty hard to top Bill Gates releasing a horde of live, hungry mosquitos on the audience at a TED conference. Angelina Jolie was among the victims. Yep, she of the bee-stung lips left TED with a mosquito-bit something. I can’t be more specific because the particular part of her anatomy that got bit was left out of the story. What passes for journalism today.

via The Pragmatic Bookshelf.

Introduced me to some great publicity I missed. Very entertaining to watch the videos.

And don’t miss the successful Old Spice videos, if you haven’t seen them already.

Whatever Happened to Voice Recognition?

Posted on : 05-07-2010 | By : Benjamin | In : IT, Musings, science

Tags: ,

View Comments

In 2004, Mike Bliss composed a poem about voice recognition. He then read it to voice recognition software on his PC, and rewrote it as recognized.

a poem by Mike Bliss

like a baby, it listens
it can’t discriminate
it tries to understand
it reflects what it thinks you say
it gets it wrong… sometimes
sometimes it gets it right.
One day it will grow up,
like a baby, it has potential
will it go to work?
will it turn to crime?
you look at it indulgently.
you can’t help loving it, can you?

a poem by like myth

like a baby, it nuisance
it can’t discriminate
it tries to oven
it reflects lot it things you say
it gets it run sometimes
sometimes it gets it right
won’t day it will grow bop
Ninth a baby, it has provincial
will it both to look?
will it the two crime?
you move at it inevitably
you can’t help loving it, cannot you?

There’s only one teeny-tiny problem with this magical future world of computers we control with our voices.

Voice-recognition-accuracy-rate-over-time it-2 musings science Whatever Happened to Voice Recognition? 6a0120a85dcdae970b0133f186ffd9970b 800wi

It doesn’t work.

Despite ridiculous, order of magnitude increases in computing power over the last decade, we can’t figure out how to get speech recognition accuracy above 80% — when the baseline human voice transcription accuracy rate is anywhere from 96% to 98%!

via Whatever Happened to Voice Recognition?.

But the software is so cheap… (I think we’d be better off teaching more people how to keyboard type and use Google without http://lmgtfy.com?q=lmgtfy )

Microsoft Kin Discontinued After 48 Days

Posted on : 01-07-2010 | By : Benjamin | In : business, tech

Tags: , ,

View Comments

Just 48 days after Microsoft began selling the Kin, a smartphone for the younger set, the company discontinued it because of disappointing sales.

via Microsoft Kin Discontinued After 48 Days – NYTimes.com.

All I can say is ‘Wow, that is really embarrassing’.

Found on Pandora

Posted on : 30-06-2010 | By : Benjamin | In : Music

Tags: ,

View Comments

 music-2 Found on Pandora 51wats7ygbl _sl160_ music-2 Found on Pandora irthazujewi 20las2o1ab000xfcxr0

Be Honest
by a dream too late
From the Album Intermission To The Moon

Really like it.  (It’s on my Pandora Station, Among The Oak & Ash Radio)

Privacy Theatre, Google, and Users of this website accept it’s TOS

Posted on : 30-06-2010 | By : Benjamin | In : business, tech

Tags: , ,

View Comments

Ben Adida writes:

Privacy Advocacy Theater

May 27, 2010 @ 1:58 pm

Ed Felten recently used the very nice term Privacy Theater in describing the insanity of 6,000-word privacy agreements that we pretend to understand. The term, inspired by Bruce Schneier’s “security theater” description of US airport security, may have been introduced by Rohit Khare in December 2009 on TechCrunch, where he described how “social networks only pretend to protect your privacy.” These are real issues, and I wholeheartedly agree that long privacy policies and generally consumer-directed fine-print are all theater.

I like this idea.  He then discusses what he calls advocacy theatre:

I want to focus on a related problem that I’ll call privacy advocacy theater. This is a problem that my friends and colleagues are guilty of, and I’m sure I’m guilty of it at times, too. Privacy Advocacy Theater is the act of extreme criticism for an accidental data breach rather than a systemic privacy design flaw. Example: if you’re up in arms over the Google Street View privacy “fiasco” of the last few days, you’re guilty of Privacy Advocacy Theater. (If you’re generally worried about Google Street View, that’s a different problem, there are real concerns there, but I’m only talking about the collection of wifi network payload data Google performed by mistake.)

On a technical level, Ben follows up:

devices, payload data, and why Kim is (in part) right.

June 1, 2010 @ 8:19 pm

A few days ago, I wrote about privacy advocacy theater and lamented how some folks, including EPIC and Kim Cameron, are attacking Google in a needlessly harsh way for what was an accidental collection of data. Kim Cameron responded, and he is right to point out that my argument, in the Google case, missed an important issue.

Kim points out that two issues got confused in the flurry of press activity: the accidental collection of payload data, i.e. the URLs and web content you browsed on unsecured wifi at the moment the Google Street View car was driving by, and the intentional collection of device identifiers, i.e. the network hardware identifiers and network names of public wifi access points. Kim thinks the network identifiers are inherently more problematic than the payload, because they last for quite a bit of time, while payload data, collected for a few randomly chosen milliseconds, are quite ephemeral and unlikely to be problematic.

Kim’s right on both points. Discussion of device identifiers, which I missed in my first post, is necessary, because the data collection, in this case, was intentional, and apparently was not disclosed, as documented inEPIC’s letter to the FCC. If Google is collecting public wifi data, they should at least disclose it. In their blog post on this topic, Google does not clarify that issue.

I enjoyed the way of thinking here in addition to the issues discussed.

Could your Dell fail? (It might have already)

Posted on : 30-06-2010 | By : Benjamin | In : business, tech

Tags: , ,

View Comments

I sit here on a Dell Optiplex 755 and wonder.

According to company memorandums and other documents recently unsealed in a civil case against Dell in Federal District Court in North Carolina, Dell appears to have suffered from the bad capacitors, made by a company called Nichicon, far more than its rivals. Internal documents show that Dell shipped at least 11.8 million computers from May 2003 to July 2005 that were at risk of failing because of the faulty components. These were Dell’s OptiPlex desktop computers — the company’s mainstream products sold to business and government customers.

A study by Dell found that OptiPlex computers affected by the bad capacitors were expected to cause problems up to 97 percent of the time over a three-year period, according to the lawsuit.

As complaints mounted, Dell hired a contractor to investigate the situation. According to a Dell filing in the lawsuit, which has not yet gone to trial, the contractor found that 10 times more computers were at risk of failing than Dell had estimated. Making problems worse, Dell replaced faulty motherboards with other faulty motherboards, according to the contractor’s findings.

Carey Holzman, a computer expert who investigated the capacitor problems and collected photos from people with broken motherboards, had a different take on the safety situation.

“Of course it’s dangerous,” Mr. Holzman said. “Having leaking capacitors is a huge problem.” He found that the capacitor problems could cause computers to catch fire [emphasis added -BF].

Financial Disaster Expert Needed

Posted on : 30-06-2010 | By : Benjamin | In : science

Tags:

View Comments

Can this octopus help?

Which Team Are You On ‘at the end of the day’?

Posted on : 30-06-2010 | By : Benjamin | In : Uncategorized

Tags: , , , , ,

View Comments

Apparently Supreme Court nominee Elana Kagan was asked which Twilight team she is on, as Senator Amy Klobuchar asked:

The senator jokingly asked Kagan’s thoughts on ”the vampire versus the werewolf uncategorized Which Team Are You On 'at the end of the day'? irthazujewi 20lur2o1.”

Klobuchar, a Minnesota Democrat, said she realized Kagan ”can’t comment on future cases. So I’ll leave that alone.”

Of course she couldn’t answer that. (Team Edward uncategorized Which Team Are You On 'at the end of the day'? irthazujewi 20lur2o1)

p.s And this is posted in the NYTimes under Politics?

Bad Behavior has blocked 556 access attempts in the last 7 days.

This site is protected with Urban Giraffe's plugin 'HTML Purified' and Edward Z. Yang's Powered by HTML Purifier. 1028 items have been purified.

Performance Optimization WordPress Plugins by W3 EDGE