UK diy (uk.d-i-y) For the discussion of all topics related to diy (do-it-yourself) in the UK. All levels of experience and proficency are welcome to join in to ask questions or offer solutions.

Reply
 
LinkBack Thread Tools Search this Thread Display Modes
  #1   Report Post  
Posted to uk.d-i-y
external usenet poster
 
Posts: 14,085
Default regex guru required

Given the string:

td class="l"span title="Word_1"Word_1/span/tdtd
class="l"span title=""/span/tdtd class="l"Word_3/tdtd
class="l" style="color: green;"Word_4/td

What regex magicary for PHP's preg_match_all that can extract just
the text of the "Word_n" fields *including* the empty Word_2. That is
I want a list or four variables filled with Word_1, Word_2, Word_3
and Word_4 even when a field is empty. The "words" change and so does
the color. The actual string is longer but all subsequent fields
follow the same format as Word_3.

Just dumping everything between and or collecting everything
between and doesn't work as there are effectively empty matches
between adjacent tags. So you end up with

$1 = ""
$2 = "Word_1"
$3 = ""
$4 = ""
$5 = ""
$6 = "" (This would be "Word_2" if it wasn't empty)
$7 = ""
$8 = "Word_3"
$9 = ""
$10 = "Word_4"

Rather than:

$1 = "Word_1"
$2 = "" (This would be "Word_2" if it wasn't empty)
$3 = "Word_3"
$4 = "Word_4"

Reliably finding the end of each word is easy with: (.*?)\/[s|t]

Finding the begining is what I'm stuck on

\"(.*?)\/[s|t] fails as it leaves the span title tag in place.

\"([^].*?)\/[s|t] fails as it strips the empty Word_2

--
Cheers
Dave.



  #2   Report Post  
Posted to uk.d-i-y
external usenet poster
 
Posts: 7,829
Default regex guru required

Dave Liquorice wrote:

Given the string:

td class="l"span title="Word_1"Word_1/span/tdtd
class="l"span title=""/span/tdtd class="l"Word_3/tdtd
class="l" style="color: green;"Word_4/td

What regex magicary for PHP's preg_match_all that can extract just
the text of the "Word_n" fields *including* the empty Word_2.



No immediate answer, but I find using the online "regex tinkering tools"
helps, e.g.

https://regex101.com/r/CGUbln/1


  #3   Report Post  
Posted to uk.d-i-y
external usenet poster
 
Posts: 7,829
Default regex guru required

Andy Burns wrote:

Dave Liquorice wrote:

What regex magicary for PHP's preg_match_all that can extract just
the text of the "Word_n" fields *including* the empty Word_2.


I find using the online "regex tinkering tools" helps


Try this ...

https://regex101.com/r/CGUbln/2

  #4   Report Post  
Posted to uk.d-i-y
external usenet poster
 
Posts: 7,829
Default regex guru required

Andy Burns wrote:

https://regex101.com/r/CGUbln/2


yes, that seems to work in the general case, so the regex you want is

td[^]*(?:span[^]*)?([^]*)(?:\/span)?\/td

  #5   Report Post  
Posted to uk.d-i-y
external usenet poster
 
Posts: 39,563
Default regex guru required

On 09/11/17 17:55, Dave Liquorice wrote:
Given the string:

td class="l"span title="Word_1"Word_1/span/tdtd
class="l"span title=""/span/tdtd class="l"Word_3/tdtd
class="l" style="color: green;"Word_4/td

What regex magicary for PHP's preg_match_all that can extract just
the text of the "Word_n" fields *including* the empty Word_2. That is
I want a list or four variables filled with Word_1, Word_2, Word_3
and Word_4 even when a field is empty. The "words" change and so does
the color. The actual string is longer but all subsequent fields
follow the same format as Word_3.

Just dumping everything between and or collecting everything
between and doesn't work as there are effectively empty matches
between adjacent tags. So you end up with

$1 = ""
$2 = "Word_1"
$3 = ""
$4 = ""
$5 = ""
$6 = "" (This would be "Word_2" if it wasn't empty)
$7 = ""
$8 = "Word_3"
$9 = ""
$10 = "Word_4"

Rather than:

$1 = "Word_1"
$2 = "" (This would be "Word_2" if it wasn't empty)
$3 = "Word_3"
$4 = "Word_4"

Reliably finding the end of each word is easy with: (.*?)\/[s|t]

Finding the begining is what I'm stuck on

\"(.*?)\/[s|t] fails as it leaves the span title tag in place.

\"([^].*?)\/[s|t] fails as it strips the empty Word_2

do it in stages.

Find what is between td/td first.

Then eliminate anything between and

Whats left, if anything, will be the wanted words



--
Canada is all right really, though not for the whole weekend.

"Saki"


  #6   Report Post  
Posted to uk.d-i-y
external usenet poster
 
Posts: 7,829
Default regex guru required

Andy Burns wrote:

the regex you want is


If you want it more general, so it will capture the inner text from
within either a single, or double nested set of html elements,
regardless of what the element types are ...

https://regex101.com/r/CGUbln/3


  #7   Report Post  
Posted to uk.d-i-y
external usenet poster
 
Posts: 2,115
Default regex guru required

On Thu, 09 Nov 2017 19:30:39 +0000, The Natural Philosopher wrote:

On 09/11/17 17:55, Dave Liquorice wrote:
Given the string:

td class="l"span title="Word_1"Word_1/span/tdtd
class="l"span title=""/span/tdtd class="l"Word_3/tdtd
class="l" style="color: green;"Word_4/td

What regex magicary for PHP's preg_match_all that can extract just the
text of the "Word_n" fields *including* the empty Word_2. That is I
want a list or four variables filled with Word_1, Word_2, Word_3 and
Word_4 even when a field is empty. The "words" change and so does the
color. The actual string is longer but all subsequent fields follow the
same format as Word_3.

Just dumping everything between and or collecting everything
between and doesn't work as there are effectively empty matches
between adjacent tags. So you end up with

$1 = ""
$2 = "Word_1"
$3 = ""
$4 = ""
$5 = ""
$6 = "" (This would be "Word_2" if it wasn't empty)
$7 = ""
$8 = "Word_3"
$9 = ""
$10 = "Word_4"

Rather than:

$1 = "Word_1"
$2 = "" (This would be "Word_2" if it wasn't empty)
$3 = "Word_3"
$4 = "Word_4"

Reliably finding the end of each word is easy with: (.*?)\/[s|t]

Finding the begining is what I'm stuck on

\"(.*?)\/[s|t] fails as it leaves the span title tag in place.

\"([^].*?)\/[s|t] fails as it strips the empty Word_2

do it in stages.

Find what is between td/td first.

Then eliminate anything between and

Whats left, if anything, will be the wanted words


I once had to do a similar task, and regex really isn't the right answer.
Better to use a program that just strips out what isn't wanted.

(there was a guy, maybe this group, can't remember, who had about 400 web
pages on WWI bombing missions and he just wanted to extract names of
crew; the pages had been written by different people and weren't
consistent)




--
My posts are my copyright and if @diy_forums or Home Owners' Hub
wish to copy them they can pay me £1 a message.
Use the BIG mirror service in the UK: http://www.mirrorservice.org
*lightning surge protection* - a w_tom conductor
  #8   Report Post  
Posted to uk.d-i-y
external usenet poster
 
Posts: 7,829
Default regex guru required

Bob Eager wrote:

I once had to do a similar task, and regex really isn't the right answer.


If I had a choice something that could read the html document and access
it through the DOM model, perhaps with XPath, and PHP is almost never my
weapon of choice ...
  #9   Report Post  
Posted to uk.d-i-y
external usenet poster
 
Posts: 2,115
Default regex guru required

On Thu, 09 Nov 2017 21:06:43 +0000, Andy Burns wrote:

Bob Eager wrote:

I once had to do a similar task, and regex really isn't the right
answer.


If I had a choice something that could read the html document and access
it through the DOM model, perhaps with XPath, and PHP is almost never my
weapon of choice ...


I used this:

http://www.ml1.org.uk

A program over 50 years old....!

--
My posts are my copyright and if @diy_forums or Home Owners' Hub
wish to copy them they can pay me £1 a message.
Use the BIG mirror service in the UK: http://www.mirrorservice.org
*lightning surge protection* - a w_tom conductor
  #10   Report Post  
Posted to uk.d-i-y
external usenet poster
 
Posts: 14,085
Default regex guru required

On Thu, 9 Nov 2017 19:30:39 +0000, The Natural Philosopher wrote:

Find what is between td/td first.

Then eliminate anything between and

Whats left, if anything, will be the wanted words


Trouble is if there is nothing left (empty field) it doesn't return
anything for that position.

--
Cheers
Dave.





  #11   Report Post  
Posted to uk.d-i-y
external usenet poster
 
Posts: 39,563
Default regex guru required

On 09/11/17 23:37, Dave Liquorice wrote:
On Thu, 9 Nov 2017 19:30:39 +0000, The Natural Philosopher wrote:

Find what is between td/td first.

Then eliminate anything between and

Whats left, if anything, will be the wanted words


Trouble is if there is nothing left (empty field) it doesn't return
anything for that position.

PHP will return a null string.


--
If you tell a lie big enough and keep repeating it, people will
eventually come to believe it. The lie can be maintained only for such
time as the State can shield the people from the political, economic
and/or military consequences of the lie. It thus becomes vitally
important for the State to use all of its powers to repress dissent, for
the truth is the mortal enemy of the lie, and thus by extension, the
truth is the greatest enemy of the State.

Joseph Goebbels



  #12   Report Post  
Posted to uk.d-i-y
external usenet poster
 
Posts: 1,375
Default regex guru required

On 09/11/17 21:06, Andy Burns wrote:
Bob Eager wrote:

I once had to do a similar task, and regex really isn't the right answer.


If I had a choice something that could read the html document and access
it through the DOM model, perhaps with XPath, and PHP is almost never my
weapon of choice ...


And for that, I've used CSS selectors with Pup.

https://github.com/EricChiang/pup

Which works a bit like jq on json.

--
Adrian C
  #13   Report Post  
Posted to uk.d-i-y
external usenet poster
 
Posts: 1,285
Default regex guru required

On Thu, 09 Nov 2017 17:55:20 +0000 (GMT), "Dave Liquorice"
wrote:

Given the string:

td class="l"span title="Word_1"Word_1/span/tdtd
class="l"span title=""/span/tdtd class="l"Word_3/tdtd
class="l" style="color: green;"Word_4/td

What regex magicary for PHP's preg_match_all that can extract just
the text of the "Word_n" fields *including* the empty Word_2. That is
I want a list or four variables filled with Word_1, Word_2, Word_3
and Word_4 even when a field is empty. The "words" change and so does
the color. The actual string is longer but all subsequent fields
follow the same format as Word_3.

Just dumping everything between and or collecting everything
between and doesn't work as there are effectively empty matches
between adjacent tags. So you end up with

$1 = ""
$2 = "Word_1"
$3 = ""
$4 = ""
$5 = ""
$6 = "" (This would be "Word_2" if it wasn't empty)
$7 = ""
$8 = "Word_3"
$9 = ""
$10 = "Word_4"

Rather than:

$1 = "Word_1"
$2 = "" (This would be "Word_2" if it wasn't empty)
$3 = "Word_3"
$4 = "Word_4"

Reliably finding the end of each word is easy with: (.*?)\/[s|t]

Finding the begining is what I'm stuck on

\"(.*?)\/[s|t] fails as it leaves the span title tag in place.

\"([^].*?)\/[s|t] fails as it strips the empty Word_2


I don't know PHP, but I'd use an XML parser for this. regexs seems
innappropriate for this kind of task.

--
If a man stands in a forest and no woman is around to hear him, is he still wrong?
  #14   Report Post  
Posted to uk.d-i-y
external usenet poster
 
Posts: 14,085
Default regex guru required

On Fri, 10 Nov 2017 06:15:41 +0000, The Natural Philosopher wrote:

Find what is between td/td first.

Then eliminate anything between and

Whats left, if anything, will be the wanted words


Trouble is if there is nothing left (empty field) it doesn't

return
anything for that position.


PHP will return a null string.


It didn't they ways I tried.

And Andy yes, I did see your posts and solution that does work, thank
you. Now need to a) work out what it's doing, I think it's the
?...)? construct. b) investigate the site you used.

As for PHP not being the best tool, maybe, but it's part of an
existing PHP page that works apart from that one little niggle.

--
Cheers
Dave.



  #15   Report Post  
Posted to uk.d-i-y
external usenet poster
 
Posts: 7,829
Default regex guru required

Dave Liquorice wrote:

Andy yes, I did see your posts and solution that does work, thank
you. Now need to a) work out what it's doing,


that's the thing with regex ... don't ask me in 6 months what it's
doing, especially the 3rd version!

Rather than using .* to match the remainder of html tags I used [^]*
which is less greedy, to match everything up to, but not including the
closing chevron to make sure it matches just a single tag at a time.

I think it's the ?...)? construct.


That's a non-capturing group since you're not really interested in the
second level of html tags wrapping the inner text, other than to notice
and skip them, that way you don't need to worry about the "Nth" match
varying depending if the span tags exist or not,

so it matches ...

opening tdoptional spanCAPTURED-TEXToptional close spanclose td

of course it could get confused if the site you're scraping from
suddenly uses a third level of tags inside the span for some rows.

b) investigate the site you used.

I've found it handy several times.

As for PHP not being the best tool, maybe, but it's part of an
existing PHP page that works apart from that one little niggle.


I guessed as much ...


  #16   Report Post  
Posted to uk.d-i-y
external usenet poster
 
Posts: 7,829
Default regex guru required

Adrian Caspersz wrote:

https://github.com/EricChiang/pup
Which works a bit like jq on json.


I have been trying to avoid picking sides between Go and Rust ...
Reply
Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Posting Rules

Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On


Similar Threads
Thread Thread Starter Forum Replies Last Post
Need regex code for counting newsgroups Joe Gwinn Metalworking 10 August 18th 13 04:31 PM
Glue Guru update Tom Gardner Metalworking 2 January 22nd 05 02:10 PM
Glue Guru needed Tom Gardner Metalworking 13 January 21st 05 01:47 AM
Needed: Electron-Minded Guru Owen Lowe Woodturning 20 August 31st 04 05:07 PM
Be a computer Guru !!! Old Nick Metalworking 3 February 20th 04 08:13 PM


All times are GMT +1. The time now is 05:22 AM.

Powered by vBulletin® Copyright ©2000 - 2024, Jelsoft Enterprises Ltd.
Copyright ©2004-2024 DIYbanter.
The comments are property of their posters.
 

About Us

"It's about DIY & home improvement"