DIYbanter - regex guru required

DIYbanter (https://www.diybanter.com/)

- UK diy (https://www.diybanter.com/uk-diy/)

- - regex guru required (https://www.diybanter.com/uk-diy/599756-regex-guru-required.html)

Dave Liquorice[_2_]

November 9th 17 05:55 PM

regex guru required

Given the string:

td class="l"span title="Word_1"Word_1/span/tdtd
class="l"span title=""/span/tdtd class="l"Word_3/tdtd
class="l" style="color: green;"Word_4/td

What regex magicary for PHP's preg_match_all that can extract just
the text of the "Word_n" fields *including* the empty Word_2. That is
I want a list or four variables filled with Word_1, Word_2, Word_3
and Word_4 even when a field is empty. The "words" change and so does
the color. The actual string is longer but all subsequent fields
follow the same format as Word_3.

Just dumping everything between and or collecting everything
between and doesn't work as there are effectively empty matches
between adjacent tags. So you end up with

$1 = ""
$2 = "Word_1"
$3 = ""
$4 = ""
$5 = ""
$6 = "" (This would be "Word_2" if it wasn't empty)
$7 = ""
$8 = "Word_3"
$9 = ""
$10 = "Word_4"

Rather than:

$1 = "Word_1"
$2 = "" (This would be "Word_2" if it wasn't empty)
$3 = "Word_3"
$4 = "Word_4"

Reliably finding the end of each word is easy with: (.*?)\/[s|t]

Finding the begining is what I'm stuck on

\"(.*?)\/[s|t] fails as it leaves the span title tag in place.

\"([^].*?)\/[s|t] fails as it strips the empty Word_2

--
Cheers
Dave.

Andy Burns[_13_]

November 9th 17 06:48 PM

regex guru required

Dave Liquorice wrote:

Given the string:

td class="l"span title="Word_1"Word_1/span/tdtd
class="l"span title=""/span/tdtd class="l"Word_3/tdtd
class="l" style="color: green;"Word_4/td

What regex magicary for PHP's preg_match_all that can extract just
the text of the "Word_n" fields *including* the empty Word_2.

No immediate answer, but I find using the online "regex tinkering tools"
helps, e.g.

https://regex101.com/r/CGUbln/1

Andy Burns[_13_]

November 9th 17 07:15 PM

regex guru required

Andy Burns wrote:

Dave Liquorice wrote:

What regex magicary for PHP's preg_match_all that can extract just
the text of the "Word_n" fields *including* the empty Word_2.

I find using the online "regex tinkering tools" helps

Try this ...

https://regex101.com/r/CGUbln/2

Andy Burns[_13_]

November 9th 17 07:21 PM

regex guru required

Andy Burns wrote:

https://regex101.com/r/CGUbln/2

yes, that seems to work in the general case, so the regex you want is

td[^]*(?:span[^]*)?([^]*)(?:\/span)?\/td

The Natural Philosopher[_2_]

November 9th 17 07:30 PM

regex guru required

On 09/11/17 17:55, Dave Liquorice wrote:
Given the string:

td class="l"span title="Word_1"Word_1/span/tdtd
class="l"span title=""/span/tdtd class="l"Word_3/tdtd
class="l" style="color: green;"Word_4/td

What regex magicary for PHP's preg_match_all that can extract just
the text of the "Word_n" fields *including* the empty Word_2. That is
I want a list or four variables filled with Word_1, Word_2, Word_3
and Word_4 even when a field is empty. The "words" change and so does
the color. The actual string is longer but all subsequent fields
follow the same format as Word_3.

Just dumping everything between and or collecting everything
between and doesn't work as there are effectively empty matches
between adjacent tags. So you end up with

$1 = ""
$2 = "Word_1"
$3 = ""
$4 = ""
$5 = ""
$6 = "" (This would be "Word_2" if it wasn't empty)
$7 = ""
$8 = "Word_3"
$9 = ""
$10 = "Word_4"

Rather than:

$1 = "Word_1"
$2 = "" (This would be "Word_2" if it wasn't empty)
$3 = "Word_3"
$4 = "Word_4"

Reliably finding the end of each word is easy with: (.*?)\/[s|t]

Finding the begining is what I'm stuck on

\"(.*?)\/[s|t] fails as it leaves the span title tag in place.

\"([^].*?)\/[s|t] fails as it strips the empty Word_2

do it in stages.

Find what is between td/td first.

Then eliminate anything between and

Whats left, if anything, will be the wanted words

--
Canada is all right really, though not for the whole weekend.

"Saki"

Andy Burns[_13_]

November 9th 17 07:49 PM

regex guru required

Andy Burns wrote:

the regex you want is

If you want it more general, so it will capture the inner text from
within either a single, or double nested set of html elements,
regardless of what the element types are ...

https://regex101.com/r/CGUbln/3

Bob Eager[_5_]

November 9th 17 08:59 PM

regex guru required

On Thu, 09 Nov 2017 19:30:39 +0000, The Natural Philosopher wrote:

On 09/11/17 17:55, Dave Liquorice wrote:
Given the string:

td class="l"span title="Word_1"Word_1/span/tdtd
class="l"span title=""/span/tdtd class="l"Word_3/tdtd
class="l" style="color: green;"Word_4/td

What regex magicary for PHP's preg_match_all that can extract just the
text of the "Word_n" fields *including* the empty Word_2. That is I
want a list or four variables filled with Word_1, Word_2, Word_3 and
Word_4 even when a field is empty. The "words" change and so does the
color. The actual string is longer but all subsequent fields follow the
same format as Word_3.

Just dumping everything between and or collecting everything
between and doesn't work as there are effectively empty matches
between adjacent tags. So you end up with

$1 = ""
$2 = "Word_1"
$3 = ""
$4 = ""
$5 = ""
$6 = "" (This would be "Word_2" if it wasn't empty)
$7 = ""
$8 = "Word_3"
$9 = ""
$10 = "Word_4"

Rather than:

$1 = "Word_1"
$2 = "" (This would be "Word_2" if it wasn't empty)
$3 = "Word_3"
$4 = "Word_4"

Reliably finding the end of each word is easy with: (.*?)\/[s|t]

Finding the begining is what I'm stuck on

\"(.*?)\/[s|t] fails as it leaves the span title tag in place.

\"([^].*?)\/[s|t] fails as it strips the empty Word_2

do it in stages.

Find what is between td/td first.

Then eliminate anything between and

Whats left, if anything, will be the wanted words

I once had to do a similar task, and regex really isn't the right answer.
Better to use a program that just strips out what isn't wanted.

(there was a guy, maybe this group, can't remember, who had about 400 web
pages on WWI bombing missions and he just wanted to extract names of
crew; the pages had been written by different people and weren't
consistent)

--
My posts are my copyright and if @diy_forums or Home Owners' Hub
wish to copy them they can pay me Â£1 a message.
Use the BIG mirror service in the UK: http://www.mirrorservice.org
*lightning surge protection* - a w_tom conductor

Andy Burns[_13_]

November 9th 17 09:06 PM

regex guru required

Bob Eager wrote:

I once had to do a similar task, and regex really isn't the right answer.

If I had a choice something that could read the html document and access
it through the DOM model, perhaps with XPath, and PHP is almost never my
weapon of choice ...

Bob Eager[_5_]

November 9th 17 09:30 PM

regex guru required

On Thu, 09 Nov 2017 21:06:43 +0000, Andy Burns wrote:

Bob Eager wrote:

I once had to do a similar task, and regex really isn't the right
answer.

If I had a choice something that could read the html document and access
it through the DOM model, perhaps with XPath, and PHP is almost never my
weapon of choice ...

I used this:

http://www.ml1.org.uk

A program over 50 years old....!

--
My posts are my copyright and if @diy_forums or Home Owners' Hub
wish to copy them they can pay me Â£1 a message.
Use the BIG mirror service in the UK: http://www.mirrorservice.org
*lightning surge protection* - a w_tom conductor

Dave Liquorice[_2_]

November 9th 17 11:37 PM

regex guru required

On Thu, 9 Nov 2017 19:30:39 +0000, The Natural Philosopher wrote:

Find what is between td/td first.

Then eliminate anything between and

Whats left, if anything, will be the wanted words

Trouble is if there is nothing left (empty field) it doesn't return
anything for that position.

--
Cheers
Dave.

The Natural Philosopher[_2_]

November 10th 17 06:15 AM

regex guru required

On 09/11/17 23:37, Dave Liquorice wrote:
On Thu, 9 Nov 2017 19:30:39 +0000, The Natural Philosopher wrote:

Find what is between td/td first.

Then eliminate anything between and

Whats left, if anything, will be the wanted words

Trouble is if there is nothing left (empty field) it doesn't return
anything for that position.

PHP will return a null string.

--
If you tell a lie big enough and keep repeating it, people will
eventually come to believe it. The lie can be maintained only for such
time as the State can shield the people from the political, economic
and/or military consequences of the lie. It thus becomes vitally
important for the State to use all of its powers to repress dissent, for
the truth is the mortal enemy of the lie, and thus by extension, the
truth is the greatest enemy of the State.

Joseph Goebbels

Adrian Caspersz

November 10th 17 02:45 PM

regex guru required

On 09/11/17 21:06, Andy Burns wrote:
Bob Eager wrote:

I once had to do a similar task, and regex really isn't the right answer.

If I had a choice something that could read the html document and access
it through the DOM model, perhaps with XPath, and PHP is almost never my
weapon of choice ...

And for that, I've used CSS selectors with Pup.

https://github.com/EricChiang/pup

Which works a bit like jq on json.

--
Adrian C

Mark[_24_]

November 10th 17 03:40 PM

regex guru required

On Thu, 09 Nov 2017 17:55:20 +0000 (GMT), "Dave Liquorice"
wrote:

Given the string:

td class="l"span title="Word_1"Word_1/span/tdtd
class="l"span title=""/span/tdtd class="l"Word_3/tdtd
class="l" style="color: green;"Word_4/td

What regex magicary for PHP's preg_match_all that can extract just
the text of the "Word_n" fields *including* the empty Word_2. That is
I want a list or four variables filled with Word_1, Word_2, Word_3
and Word_4 even when a field is empty. The "words" change and so does
the color. The actual string is longer but all subsequent fields
follow the same format as Word_3.

Just dumping everything between and or collecting everything
between and doesn't work as there are effectively empty matches
between adjacent tags. So you end up with

$1 = ""
$2 = "Word_1"
$3 = ""
$4 = ""
$5 = ""
$6 = "" (This would be "Word_2" if it wasn't empty)
$7 = ""
$8 = "Word_3"
$9 = ""
$10 = "Word_4"

Rather than:

$1 = "Word_1"
$2 = "" (This would be "Word_2" if it wasn't empty)
$3 = "Word_3"
$4 = "Word_4"

Reliably finding the end of each word is easy with: (.*?)\/[s|t]

Finding the begining is what I'm stuck on

\"(.*?)\/[s|t] fails as it leaves the span title tag in place.

\"([^].*?)\/[s|t] fails as it strips the empty Word_2

I don't know PHP, but I'd use an XML parser for this. regexs seems
innappropriate for this kind of task.

--
If a man stands in a forest and no woman is around to hear him, is he still wrong?

Dave Liquorice[_2_]

November 11th 17 04:12 PM

regex guru required

On Fri, 10 Nov 2017 06:15:41 +0000, The Natural Philosopher wrote:

Find what is between td/td first.

Then eliminate anything between and

Whats left, if anything, will be the wanted words

Trouble is if there is nothing left (empty field) it doesn't
return
anything for that position.

PHP will return a null string.

It didn't they ways I tried.

And Andy yes, I did see your posts and solution that does work, thank
you. Now need to a) work out what it's doing, I think it's the
?:(...)? construct. b) investigate the site you used.

As for PHP not being the best tool, maybe, but it's part of an
existing PHP page that works apart from that one little niggle.

--
Cheers
Dave.

Andy Burns[_13_]

November 11th 17 04:31 PM

regex guru required

Dave Liquorice wrote:

Andy yes, I did see your posts and solution that does work, thank
you. Now need to a) work out what it's doing,

that's the thing with regex ... don't ask me in 6 months what it's
doing, especially the 3rd version!

Rather than using .* to match the remainder of html tags I used [^]*
which is less greedy, to match everything up to, but not including the
closing chevron to make sure it matches just a single tag at a time.

I think it's the ?:(...)? construct.

That's a non-capturing group since you're not really interested in the
second level of html tags wrapping the inner text, other than to notice
and skip them, that way you don't need to worry about the "Nth" match
varying depending if the span tags exist or not,

so it matches ...

opening tdoptional spanCAPTURED-TEXToptional close spanclose td

of course it could get confused if the site you're scraping from
suddenly uses a third level of tags inside the span for some rows.

b) investigate the site you used.

I've found it handy several times.

As for PHP not being the best tool, maybe, but it's part of an
existing PHP page that works apart from that one little niggle.

I guessed as much ...

Andy Burns[_13_]

November 11th 17 04:36 PM

regex guru required

Adrian Caspersz wrote:

https://github.com/EricChiang/pup
Which works a bit like jq on json.

I have been trying to avoid picking sides between Go and Rust ...

All times are GMT +1. The time now is 07:52 AM.