Home |
Search |
Today's Posts |
|
UK diy (uk.d-i-y) For the discussion of all topics related to diy (do-it-yourself) in the UK. All levels of experience and proficency are welcome to join in to ask questions or offer solutions. |
Reply |
|
LinkBack | Thread Tools | Display Modes |
#1
Posted to uk.d-i-y
|
|||
|
|||
regex guru required
Given the string:
td class="l"span title="Word_1"Word_1/span/tdtd class="l"span title=""/span/tdtd class="l"Word_3/tdtd class="l" style="color: green;"Word_4/td What regex magicary for PHP's preg_match_all that can extract just the text of the "Word_n" fields *including* the empty Word_2. That is I want a list or four variables filled with Word_1, Word_2, Word_3 and Word_4 even when a field is empty. The "words" change and so does the color. The actual string is longer but all subsequent fields follow the same format as Word_3. Just dumping everything between and or collecting everything between and doesn't work as there are effectively empty matches between adjacent tags. So you end up with $1 = "" $2 = "Word_1" $3 = "" $4 = "" $5 = "" $6 = "" (This would be "Word_2" if it wasn't empty) $7 = "" $8 = "Word_3" $9 = "" $10 = "Word_4" Rather than: $1 = "Word_1" $2 = "" (This would be "Word_2" if it wasn't empty) $3 = "Word_3" $4 = "Word_4" Reliably finding the end of each word is easy with: (.*?)\/[s|t] Finding the begining is what I'm stuck on \"(.*?)\/[s|t] fails as it leaves the span title tag in place. \"([^].*?)\/[s|t] fails as it strips the empty Word_2 -- Cheers Dave. |
#2
Posted to uk.d-i-y
|
|||
|
|||
regex guru required
Dave Liquorice wrote:
Given the string: td class="l"span title="Word_1"Word_1/span/tdtd class="l"span title=""/span/tdtd class="l"Word_3/tdtd class="l" style="color: green;"Word_4/td What regex magicary for PHP's preg_match_all that can extract just the text of the "Word_n" fields *including* the empty Word_2. No immediate answer, but I find using the online "regex tinkering tools" helps, e.g. https://regex101.com/r/CGUbln/1 |
#3
Posted to uk.d-i-y
|
|||
|
|||
regex guru required
Andy Burns wrote:
Dave Liquorice wrote: What regex magicary for PHP's preg_match_all that can extract just the text of the "Word_n" fields *including* the empty Word_2. I find using the online "regex tinkering tools" helps Try this ... https://regex101.com/r/CGUbln/2 |
#4
Posted to uk.d-i-y
|
|||
|
|||
regex guru required
Andy Burns wrote:
https://regex101.com/r/CGUbln/2 yes, that seems to work in the general case, so the regex you want is td[^]*(?:span[^]*)?([^]*)(?:\/span)?\/td |
#5
Posted to uk.d-i-y
|
|||
|
|||
regex guru required
On 09/11/17 17:55, Dave Liquorice wrote:
Given the string: td class="l"span title="Word_1"Word_1/span/tdtd class="l"span title=""/span/tdtd class="l"Word_3/tdtd class="l" style="color: green;"Word_4/td What regex magicary for PHP's preg_match_all that can extract just the text of the "Word_n" fields *including* the empty Word_2. That is I want a list or four variables filled with Word_1, Word_2, Word_3 and Word_4 even when a field is empty. The "words" change and so does the color. The actual string is longer but all subsequent fields follow the same format as Word_3. Just dumping everything between and or collecting everything between and doesn't work as there are effectively empty matches between adjacent tags. So you end up with $1 = "" $2 = "Word_1" $3 = "" $4 = "" $5 = "" $6 = "" (This would be "Word_2" if it wasn't empty) $7 = "" $8 = "Word_3" $9 = "" $10 = "Word_4" Rather than: $1 = "Word_1" $2 = "" (This would be "Word_2" if it wasn't empty) $3 = "Word_3" $4 = "Word_4" Reliably finding the end of each word is easy with: (.*?)\/[s|t] Finding the begining is what I'm stuck on \"(.*?)\/[s|t] fails as it leaves the span title tag in place. \"([^].*?)\/[s|t] fails as it strips the empty Word_2 do it in stages. Find what is between td/td first. Then eliminate anything between and Whats left, if anything, will be the wanted words -- Canada is all right really, though not for the whole weekend. "Saki" |
#6
Posted to uk.d-i-y
|
|||
|
|||
regex guru required
Andy Burns wrote:
the regex you want is If you want it more general, so it will capture the inner text from within either a single, or double nested set of html elements, regardless of what the element types are ... https://regex101.com/r/CGUbln/3 |
#7
Posted to uk.d-i-y
|
|||
|
|||
regex guru required
On Thu, 09 Nov 2017 19:30:39 +0000, The Natural Philosopher wrote:
On 09/11/17 17:55, Dave Liquorice wrote: Given the string: td class="l"span title="Word_1"Word_1/span/tdtd class="l"span title=""/span/tdtd class="l"Word_3/tdtd class="l" style="color: green;"Word_4/td What regex magicary for PHP's preg_match_all that can extract just the text of the "Word_n" fields *including* the empty Word_2. That is I want a list or four variables filled with Word_1, Word_2, Word_3 and Word_4 even when a field is empty. The "words" change and so does the color. The actual string is longer but all subsequent fields follow the same format as Word_3. Just dumping everything between and or collecting everything between and doesn't work as there are effectively empty matches between adjacent tags. So you end up with $1 = "" $2 = "Word_1" $3 = "" $4 = "" $5 = "" $6 = "" (This would be "Word_2" if it wasn't empty) $7 = "" $8 = "Word_3" $9 = "" $10 = "Word_4" Rather than: $1 = "Word_1" $2 = "" (This would be "Word_2" if it wasn't empty) $3 = "Word_3" $4 = "Word_4" Reliably finding the end of each word is easy with: (.*?)\/[s|t] Finding the begining is what I'm stuck on \"(.*?)\/[s|t] fails as it leaves the span title tag in place. \"([^].*?)\/[s|t] fails as it strips the empty Word_2 do it in stages. Find what is between td/td first. Then eliminate anything between and Whats left, if anything, will be the wanted words I once had to do a similar task, and regex really isn't the right answer. Better to use a program that just strips out what isn't wanted. (there was a guy, maybe this group, can't remember, who had about 400 web pages on WWI bombing missions and he just wanted to extract names of crew; the pages had been written by different people and weren't consistent) -- My posts are my copyright and if @diy_forums or Home Owners' Hub wish to copy them they can pay me £1 a message. Use the BIG mirror service in the UK: http://www.mirrorservice.org *lightning surge protection* - a w_tom conductor |
#8
Posted to uk.d-i-y
|
|||
|
|||
regex guru required
Bob Eager wrote:
I once had to do a similar task, and regex really isn't the right answer. If I had a choice something that could read the html document and access it through the DOM model, perhaps with XPath, and PHP is almost never my weapon of choice ... |
#9
Posted to uk.d-i-y
|
|||
|
|||
regex guru required
On Thu, 09 Nov 2017 21:06:43 +0000, Andy Burns wrote:
Bob Eager wrote: I once had to do a similar task, and regex really isn't the right answer. If I had a choice something that could read the html document and access it through the DOM model, perhaps with XPath, and PHP is almost never my weapon of choice ... I used this: http://www.ml1.org.uk A program over 50 years old....! -- My posts are my copyright and if @diy_forums or Home Owners' Hub wish to copy them they can pay me £1 a message. Use the BIG mirror service in the UK: http://www.mirrorservice.org *lightning surge protection* - a w_tom conductor |
#10
Posted to uk.d-i-y
|
|||
|
|||
regex guru required
On Thu, 9 Nov 2017 19:30:39 +0000, The Natural Philosopher wrote:
Find what is between td/td first. Then eliminate anything between and Whats left, if anything, will be the wanted words Trouble is if there is nothing left (empty field) it doesn't return anything for that position. -- Cheers Dave. |
#11
Posted to uk.d-i-y
|
|||
|
|||
regex guru required
On 09/11/17 23:37, Dave Liquorice wrote:
On Thu, 9 Nov 2017 19:30:39 +0000, The Natural Philosopher wrote: Find what is between td/td first. Then eliminate anything between and Whats left, if anything, will be the wanted words Trouble is if there is nothing left (empty field) it doesn't return anything for that position. PHP will return a null string. -- If you tell a lie big enough and keep repeating it, people will eventually come to believe it. The lie can be maintained only for such time as the State can shield the people from the political, economic and/or military consequences of the lie. It thus becomes vitally important for the State to use all of its powers to repress dissent, for the truth is the mortal enemy of the lie, and thus by extension, the truth is the greatest enemy of the State. Joseph Goebbels |
#12
Posted to uk.d-i-y
|
|||
|
|||
regex guru required
On 09/11/17 21:06, Andy Burns wrote:
Bob Eager wrote: I once had to do a similar task, and regex really isn't the right answer. If I had a choice something that could read the html document and access it through the DOM model, perhaps with XPath, and PHP is almost never my weapon of choice ... And for that, I've used CSS selectors with Pup. https://github.com/EricChiang/pup Which works a bit like jq on json. -- Adrian C |
#13
Posted to uk.d-i-y
|
|||
|
|||
regex guru required
On Thu, 09 Nov 2017 17:55:20 +0000 (GMT), "Dave Liquorice"
wrote: Given the string: td class="l"span title="Word_1"Word_1/span/tdtd class="l"span title=""/span/tdtd class="l"Word_3/tdtd class="l" style="color: green;"Word_4/td What regex magicary for PHP's preg_match_all that can extract just the text of the "Word_n" fields *including* the empty Word_2. That is I want a list or four variables filled with Word_1, Word_2, Word_3 and Word_4 even when a field is empty. The "words" change and so does the color. The actual string is longer but all subsequent fields follow the same format as Word_3. Just dumping everything between and or collecting everything between and doesn't work as there are effectively empty matches between adjacent tags. So you end up with $1 = "" $2 = "Word_1" $3 = "" $4 = "" $5 = "" $6 = "" (This would be "Word_2" if it wasn't empty) $7 = "" $8 = "Word_3" $9 = "" $10 = "Word_4" Rather than: $1 = "Word_1" $2 = "" (This would be "Word_2" if it wasn't empty) $3 = "Word_3" $4 = "Word_4" Reliably finding the end of each word is easy with: (.*?)\/[s|t] Finding the begining is what I'm stuck on \"(.*?)\/[s|t] fails as it leaves the span title tag in place. \"([^].*?)\/[s|t] fails as it strips the empty Word_2 I don't know PHP, but I'd use an XML parser for this. regexs seems innappropriate for this kind of task. -- If a man stands in a forest and no woman is around to hear him, is he still wrong? |
#14
Posted to uk.d-i-y
|
|||
|
|||
regex guru required
On Fri, 10 Nov 2017 06:15:41 +0000, The Natural Philosopher wrote:
Find what is between td/td first. Then eliminate anything between and Whats left, if anything, will be the wanted words Trouble is if there is nothing left (empty field) it doesn't return anything for that position. PHP will return a null string. It didn't they ways I tried. And Andy yes, I did see your posts and solution that does work, thank you. Now need to a) work out what it's doing, I think it's the ?...)? construct. b) investigate the site you used. As for PHP not being the best tool, maybe, but it's part of an existing PHP page that works apart from that one little niggle. -- Cheers Dave. |
#15
Posted to uk.d-i-y
|
|||
|
|||
regex guru required
Dave Liquorice wrote:
Andy yes, I did see your posts and solution that does work, thank you. Now need to a) work out what it's doing, that's the thing with regex ... don't ask me in 6 months what it's doing, especially the 3rd version! Rather than using .* to match the remainder of html tags I used [^]* which is less greedy, to match everything up to, but not including the closing chevron to make sure it matches just a single tag at a time. I think it's the ?...)? construct. That's a non-capturing group since you're not really interested in the second level of html tags wrapping the inner text, other than to notice and skip them, that way you don't need to worry about the "Nth" match varying depending if the span tags exist or not, so it matches ... opening tdoptional spanCAPTURED-TEXToptional close spanclose td of course it could get confused if the site you're scraping from suddenly uses a third level of tags inside the span for some rows. b) investigate the site you used. I've found it handy several times. As for PHP not being the best tool, maybe, but it's part of an existing PHP page that works apart from that one little niggle. I guessed as much ... |
#16
Posted to uk.d-i-y
|
|||
|
|||
regex guru required
Adrian Caspersz wrote:
https://github.com/EricChiang/pup Which works a bit like jq on json. I have been trying to avoid picking sides between Go and Rust ... |
Reply |
Thread Tools | Search this Thread |
Display Modes | |
|
|
Similar Threads | ||||
Thread | Forum | |||
Need regex code for counting newsgroups | Metalworking | |||
Glue Guru update | Metalworking | |||
Glue Guru needed | Metalworking | |||
Needed: Electron-Minded Guru | Woodturning | |||
Be a computer Guru !!! | Metalworking |